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Abstract 

The  protdn  folding  problem  involves  the  prediction  of  the  secondary  and  tertiary 
structure  of  a  molecular  ^stem  given  the  primary  structure.  The  prinuuy  structure  defines 
the  sequoice  of  amino-add  residues,  while  the  secondary  structure  describes  the  local  3- 
dimensional  arrangemrat  of  amino-add  reddues  within  the  molecule.  The  relative 
orientation  of  the  secondary  structural  motifit,  namdy  the  totiary  structure,  defines  the 
duq)e  of  the  entire  biomolecule.  The  exact  mechanian  by  which  such  a  sequence  of  amino 
adds  (protdn)  folds  into  its  3-dimaisional  conformation  is  unknown.  Current  iq>proadies 
to  the  protdn  foldit^  problem  include  calculus4)ased  methods,  systematic  search,  modd 
building  and  symbolic  methods,  random  methods  such  as  Monte  Cario  simulation  and 
simulated  annealing,  distance  geometry,  and  molecular  dynamics. 

Many  of  these  current  iqrproaches  search  for  conformations  wfaidi  minimize  the 
potential  energy  of  the  molecule.  Calcuhis-based  techniques  iiKlude  steqiest  descent, 
coi^gate-gradient,  and  Newton-Raphson  methods  to  minimize  a  dassical  energy  fiinction 
of  a  molecular  structure.  The  performance  of  calcuhis-based  search,  however,  is 
susceptible  to  the  many  local  minima  vdiidi  exist  in  sudi  enogy  fimctions.  Systraiatic 
search  operates  by  enumerating  all  possible  disoetized  conformations  for  a  q)ecified 
number  of  d^rees  of  fipeedom.  The  esqxinential  order  of  this  technique  makes  it 
impractical  for  larger  molecules  vdiere  the  nundier  of  d^rees  of  ficedom  can  range  fi-om  a 
few  hundred  to  thousands.  Modd  building  and  symbolic  qiproaches  enqiloy  manual 
interaction  to  define  and  manipulate  the  initial  3-dimaidonal  structure  of  a  molecule  and 
then  a  local  optimization  technique  to  r^ne  H.  Monte  Carlo  emulation  and  amulated 
annealing  are  stochastic  soni-optimization  techniques  used  to  sample  the  search  space  for 
the  lowest  enogy  conformations.  Distance  ^metry  methods  use  an  Hmitive  process  of 
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constraint  satisfaction  to  reduce  the  upper  and  lower  bounds  of  inter  atomic  distances 
within  the  molecule.  The  resulting  tighter  constraints  allow  a  local  minimization  technique 
to  find  the  final  conformation.  A  genetic  algorithm  (GA),  a  stochastic  search  technique 
modeled  after  natural  adaptive  systems,  potentially  ofiers  significant  speedup  over  other 
search  algorithms  because  of  its  inherent  parallelizability. 

The  protein  folding  problem  is  represented  in  a  GA  as  a  population  of  strings 
which  encode  a  possible  conformation  of  the  molecule  bdng  minimized.  The  GA  uses  an 
onpirical  calculation  of  the  intonal  energy  of  the  molecule  as  a  fitness  function  to 
determine  which  population  members  survive  into  the  next  generation.  Many  parallel 
implementations  of  traditional  search  techniques  distribute  the  calculation  of  the  energy 
function  across  the  processors  which  requires  communication  of  the  results  to  determine 
the  overall  energy.  This  workload  distribution  strategy  often  limits  the  scalability  of  the 
algorithm  due  to  the  fixed  problem  size  of  the  energy  function.  However,  a  parallel  GA 
can  distribute  the  population  members  evenly  across  the  processors.  Each  processor 
calculates  the  internal  energy  (fitness)  of  each  of  its  population  members  independent  of 
the  other  processors.  Only  minimal  communications  between  processors  is  necessary  in 
order  to  manage  the  overall  population. 

The  results  of  applying  a  GA  to  the  protein  folding  problem  show  significant 
improvement  in  execution  time  when  compared  to  serial  implementations  of  the  GA.  In 
addition,  the  parallel  GA  demonstrates  good  scalability  characteristics  since  the 
communications  strategy  used  to  mam^e  the  population  can  be  tailored  to  the  parallel 
architecture. 
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TO  THE 

PROTEIN  FOLDING  PROBLEM 


I.  Introduction 

As  computers  become  foster  and  fosto*,  the  size  of  problons  which  are  tackled 
using  computers  become  larger  and  more  complex.  Computers  of  today  are  able  to  work 
with  large  amounts  of  data,  manipulate  this  data,  and  solve  difiBcult  problems  using 
sophisticated  algorithms.  Con^mter  de«gners  have,  however,  begun  to  iq)proach  the 
physical  limitations  of  electronic  conqxrnoits,  effectively  bounding  the  computational 
speed  of  single  processors  used  to  solve  complex  problems  (DeC^ama,  1989).  Recently, 
designs  for  high  performance  conqruters  have  tended  toward  multiple  processor  systems. 
The  goal  of  this  new  type  of  architecture  is  to  have  multiple  processors  working  in  parallel 
to  speedup  the  overall  throughput  of  the  system.  Ideally,  the  effect  is  a  machine  with  an 
overall  performance  capability  directly  proportional  to  the  sum  of  the  performance  of  tlw 
processors  which  make  up  the  system.  However,  the  fuU  potential  speedup  offered  by 
parallel  architectures  is  often  never  rerdized,  primarily  because  it  is  difficult  to 
simultaneously  keep  all  processors  busy  doing  useful  work.  Some  overhead  is  associated 
with  communications  between  processors  which  results  in  a  drain  on  the  performance  of 
parallel  architectures.  To  take  advantage  of  the  speedup  offered  by  parallel  architectures, 
algorithms  must  be  designed  to  perform  actions  rimultaneously,  and  with  certain  level  of 
independence.  Only  when  the  algorithm  is  able  to  divide  the  workload  up  evenly  among 
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all  the  processors,  with  minimal  int^rocessor  depradoicies,  is  the  system  able  to  achieve 
close  to  maximum  performance. 

However,  even  with  the  order  of  magnitude  increases  in  hardware  performance, 
there  remain  problems  whose  solution  >vith  ordinary  algorithmic  approaches  remain 
unsolvable  within  a  reasonable  amount  of  time.  These  intractable  problems  grow 
exponentially,  resulting  in  execution  times  on  the  order  of  years  or  centuries,  even  with  the 
fastest  hardware  available.  One  such  problem  of  international  interest  is  called  the  protein 
folding  problem. 

The  typical  solution  to  solving  intractable  problems  in  a  finite  amount  of  time  is  to 
relax  the  solution  constraint  to  accept  near  optimal  or  semi-optimal  solutions  rather  than 
the  guaranteed  globally  optimal  solution  (Cormoi,  1990).  Many  search-based  algorithms 
are  available  which  provide  good  approximations  to  the  global  solution  and  offer 
reasonable  execution  times  (Goldberg,  1989).  However,  these  algorithms  do  not 
necessarily  always  yield  good  results.  Ofioi,  they  are  problem  spedfic  and  are 
consequently  deceived  by  iq)plication  problems  outade  of  their  intended  problem  domain. 

Characteristics  ^ch  can  be  used  to  evaluate  algorithms  include  speed  of 
execution,  robustness  of  applications,  and  quality  of  solutions.  To  avoid  unreasonable 
execution  times,  for  problems  of  even  modonte  size,  the  order  of  complexity  of  the 
algorithm  must  not  exceed  low  degree  polynomial  ordn*.  Often,  exponential  order 
programs  are  used  despite  the  long  execution  times,  amply  because  there  is  no  currently 
available  algorithm  which  can  guarantee  betto*  results  in  less  time.  The  limitation  on  this 
approach  requires  the  user  to  be  constantly  aware  of  the  algorithm's  time  complexity  to 
avoid  problem  sizes  which  result  in  unacceptable  execution  times.  This  limitation  often 
leads  the  user  to  ask,  "What  is  a  moderate  aze  problem?"  To  help  understand  the  impact 
of  an  exponentially  increasing  fimction,  consider  the  following  example.  A  computer 
capable  of  evaluating  10  trillion  solutions  per  second  would  only  be  able  to  enumerate  the 
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search  space  of  a  2"  exponential  problem  up  to  a  size  of  n  =  55  before  the  execution  time 
exceeds  an  hour.  Solving  the  same  problem  for  a  size  of  n  =  60  would  take  more  than  a 
day  of  constant  execution,  and  for  a  problem  of  size  n  =  70,  the  computer  would  have  to 
be  left  running  for  close  to  four  years.  It  becomes  obvious  that  for  problems  of  relativdy 
moderate  size,  a  complete  enumerative  type  algorithm  is  not  a  feasible  option  for  solving 
exponential  order  problems.  Consequently,  advances  in  hardware  alone  will  never  achieve 
the  desired  speeds  necessary  to  solve  complex  problems  of  large  size,  without 
corresponding  improvements  in  algorithms. 

1.1  Semi>Optimai  Algorithms 

A  good  algorithm  should  be  desigmd  not  only  to  be  of  polynomial  order,  but 
should  also  be  applicable  to  a  variety  of  problems.  A  robust  algorithm  would  allow  a 
computer  user  to  apply  one  algorithm  to  any  of  a  variety  of  problems,  and  achieve  good 
results.  Algorithms  tailored  to  specific  problems  usually  have  good  or  even  excellent 
performance  within  their  limited  problem  domain.  However,  these  problem  specific 
algorithms  typically  have  poor  performance  outade  of  their  intended  application  (Pearl, 
1984).  Consequently,  computer  science  journals  are  filled  with  algorithms  which  have 
been  shown  to  perform  well  for  specific  well  defined  tasks.  Unfortunately,  there  is 
essentially  a  void  of  algorithms  which  ofier  acceptable  performance  across  a  wide 
spectrum  of  applications.  Specifically  tailored  algorithms  obtain  their  higher  performance 
in  terms  of  speed  and  solution  quality  by  making  use  of  previous  knowledge  of  the 
problem  domain  to  guide  the  search  in  the  most  efiSdent  manner. 

The  quality  of  a  solution  is  an  important  fiunor  to  semi-optimal  search  algorithms 
since  by  thdr  nature,  they  do  not  enumerate  the  entire  search  space.  Therefore,  they  can 
not  guarantee  the  solution  which  was  found  is  the  globally  optimal  solution.  Complete 
enumeration  algorithms  guarantee  the  globally  optimal  solution,  but  at  the  cost  of 
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exploring,  either  explicitly  or  implicitly,  the  oitire  search  space  (Brassard,  1988).  For 
exponential  or  combinatoric  problems,  exploring  every  elonmt  in  the  search  space  is 
simply  not  possible  within  a  reasonable  amount  of  time.  The  only  alternative  is  to  apply 
some  strategy  to  limit  the  search  space  to  a  reasonable  size  which  can  thoi  be  mplored. 

1.2  Genetic  Algorithms 

Thus,  it  is  desirable  for  an  algorithm  to  be  of  polynomial  time  complexity, 
applicable  to  a  wide  variety  of  problems,  achieve  optimal  or  near  optimal  results,  and  also 
run  efficiently  on  parallel  machines.  Genetic  algorithms  are  a  relatively  new  type  of 
algorithm  which  have  shown  strong  potential  for  meeting  these  objectives  (Goldberg, 
1989).  Genetic  algorithms  are  search  based  semi-optimization  algorithms  which  are 
similar  in  some  respects  to  random  search  based  algorithms.  The  important  difference, 
however,  is  genetic  algorithms  use  a  stochastic  sampling  procedure  which  results  in  a 
more  directed  search  strategy  than  a  purely  random  search.  Genetic  algorithms  are  of 
polynomial  time  complexity,  and  have  been  shown  to  achieve  good  results  in  a  variety  of 
applications.  Modeled  after  the  natural  evolution  process  of  selection,  mating,  and 
mutation,  genetic  algorithms  simulate  the  Darwin  theory  of  survival  of  the  fittest.  The 
search  space  is  represented  by  a  population  of  strings  upon  which  genetic  operators  act  to 
create  new  generations  of  strings.  Starting  with  an  initial  population,  the  genetic 
operators  use  information  about  the  fitness  of  the  old  population  to  create  a  new  and 
better  population  with  each  successive  generation.  The  solution  to  the  problem 
corresponds  to  the  string  representation  which  possesses  the  highest  fitness  after  an 
appropriate  number  of  evolutionary  cycles.  Since  populations  can  be  easily  divided  into 
sub-populations,  genetic  algorithms  are  also  inherently  parallelizable. 

As  an  example,  of  how  a  genetic  algorithm  could  be  used  as  a  function  optimizer, 
consider  the  following  function,  known  as  Rosenbrock's  Saddle.  The  function  is 
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charactoized  by  a  non-linear,  non-conv«c  sh^  which  is  difficult  to  optimize  using  a 


simple  hill-climbing  type  algorithm. 

In  order  to  apply  the  genetic  algorithm,  the  search  space  must  be  parameterized 
and  encoded  as  a  string.  Consider  a  domain  of  the  fiinction  from  -2.00  to  2.00  with  a 
desired  resolution  of  0.01  for  each  variable.  This  would  require  the  search  space  to  be 
discretized  to  at  least  400  individual  points  per  variable.  Using  a  binary  representation,  an 
encoding  of  9  bits  is  sufficient  to  represent  512  individual  points.  Therefore,  the  genetic 
string  can  be  represented  by  a  sequence  of  18  binary  characters,  where  the  first  half  of  the 
string  represents  the  domain  of  variable  x  and  the  second  half  of  the  string  represents  the 
domain  of y.  Now,  any  point  (x,  y)  can  be  represented  in  the  genetic  algorithm  as  a  single 
14  digit  binary  string,  where  the  values  of  x  and  y  are  equal  to  the  range  of  x  and  y  times 
the  quotient  of  the  integer  representation  of  the  genetic  string  divided  by  the  int^er  range 
of  the  genetic  string  imnus  the  zero  offset.  For  example,  the  values  of  x  and  y  in  Figure  1 
are  represented  by  the  binaiy  equivalent  of 358  (256  +  64  +  32  +  4  +  2)  and  234  (128  +  64 
+  32  +  8  +  2).  The  corresponding  real  values  are  found  by  multiplying  4.0  (the  range  of  x 
and  y)  by  the  quotient  of  358  and  234  respectively,  divided  by  512,  and  then  subtracting 
2.0  for  the  zero  offset. 

The  second  step  is  to  implement  the  means  to  evaluate  the  fitness  of  each  member 


String  Representation 

of  the  domain  of  f(x,y) 

10  110  0  110 

0  1110  10  10 

x=4(358/512)-2=0.80 

y=4(234/512)-2=-0.17 

Figure  1.  Example  Encoding  for  a  Genetic  Algorithm  String 
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of  a  population  of  strings  which  represent  candidate  solutions  to  the  function  being 
optimized.  Assuming  the  function  can  be  evaluated,  the  fimess  value  of  each  population 
member  is  simply  the  value  of  the  function  evaluated  at  the  point  (x,  y)  represented  by  the 
binary  string.  Through  the  searching  action  of  the  genetic  algorithm,  the  population  of 
strings  evolves  to  conform  to  better  and  better  solutions,  eventually  finding  the  optimal  or 
near  optimal  solution  to  the  function. 

Despite  all  the  advantages  offered  by  goietic  algorithms,  they  do  have  some 
limitations  which  lead  to  p^omumce  shortfidls  when  compared  to  more  traditional  search 
algorithms.  Three  primary  deficiencies  of  genetic  algorithms  are  premature  convergence 
on  local  optima,  lack  of  a  stopping  condition,  and  low  precision  (Eshelman,  1993;  Palmer, 
1991;  Reynolds,  1990).  These  deficiencies  are  addressed  in  more  detail  in  the  literature 
review. 

1.3  An  introduction  to  the  Protein  Foiding  Probiem 

The  protein  folding  problem  has  the  distinction  of  being  among  the  national  grand 
challenge  problems.  Simply  stated,  the  protein  folding  problem  consists  of  predicting  the 
3-dimensional  structure  of  a  protein,  given  the  sequence  of  amino-acids  which  make  up 
the  protein.  The  implications  of  "solving”  the  protdn  folding  problem  are  enormous. 
Knowing  the  3-dimensional  structure  of  a  protein  would  allow  biochemists  to  design  new 
medicines  with  very  specific  properties,  thus  minimi^g  negative  side  effects.  Engineers 
could  use  knowledge  of  the  3-dimensional  structure  of  proteins  to  develop  new  materials 
designed  with  unique  physical  qualities.  Finally,  the  genetic  information  encoded  within 
our  own  DNA  molecules  may  provide  the  answers  to  cure  disease,  birth  defects,  and  other 
genetic  abnormalities. 

Currently,  the  only  reliable  methods  available  to  determine  the  3-dimensional 
structures  of  proteins  is  X-ray  crystallography  and  Nuclear  Magnetic  Resonance  (NMR). 
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However,  both  of  these  procedures  require  an  extensive  amount  of  lab  time  in  order  to 
determine  the  structure  of  one  protein.  The  number  of  known  protein  sequences  already 
outnumbers  the  number  of  known  protein  structures  by  nx>re  than  two  orders  of 
magnitude  (Lengauer,  1993).  Given  the  currem  rate  of  discovery  of  new  proteins,  laigdy 
due  to  the  woilc  done  in  the  Human  Genome  Project  (US  Congress,  1988),  this  g^  is 
expected  to  continue  to  widen  until  a  rdiable  and  efiSdent  method  is  found  to  accurately 
predict  the  3-dimensioiial  structure  of  a  proton  given  the  primary  sequence  of  amino*acids 
which  make  up  the  protein.  Even  though  a  prediction  algorithm  is  not  as  definitive  as  the 
actual  determination  of  a  protein's  structure  through  NMR  or  X-ray  crystallography,  if  it  is 
able  to  demonstrate  a  reasonable  level  of  accuracy  at  predicting  the  structures  of  already 
known  proton  sequences,  then  there  will  be  a  corre^nding  level  of  confidence  in  its 
ability  to  accurately  predict  the  structures  of  new  proton  sequences. 

The  absolute  solution  to  the  proton  folding  problem  can  be  modeled  by  two 
classical  approaches  (Lengauer,  1993).  The  first  approach  uses  an  ah  initio  energy 
minimization  rqiproach  which  accounts  for  all  quantum-mechanical  effects  between  atoms 
of  the  molecule;  an  0(n^)  operation  for  each  conformation.  The  final  conformation 
corresponding  to  the  minimum  energy  state,  is  then  assumed  to  be  the  natural 
3-dimensional  structure  of  the  protein.  The  problem  then  becomes,  how  to  effidently 
search  the  conformational  space  for  the  lowest  energy  structure.  A  complete  enumeration 
of  the  search  space  is  impossible,  yet  without  explidtly  or  implicitly  search  the  entire 
search  space,  the  global  minimum  structure  can  not  be  guaranteed. 

The  second  approach  calculates  the  molecular  forces  acting  on  each  atom  of  the 
molecule  and  applies  Newtonian  laws  of  motion  to  predict  the  trajectories  of  each  atom  as 
an  N-body  simulation.  The  greatest  drawback  to  this  approach  is  the  required  time  steps, 
in  order  to  accurately  simulate  motion  at  the  atomic  level,  are  much  too  small  to  be  able  to 
simulate  the  entire  folding  of  a  protein  in  our  life  time.  Consequently,  simplifications  or 
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approxiinations  necessary  in  order  to  increase  the  size  of  the  time  step.  If  the  locations  of 
each  of  the  atoms  near  their  globally  mininuim  conformation  could  be  iy)proximated,  then 
each  of  the  classical  approaches  would  be  well  suited  to  refine  the  final  conformation. 
However,  given  an  unknown  sequence  of  amino-adds  for  which  no  structural  information 
is  available,  a  classical  approach  to  structure  prediction  becomes  intractable  for  even 
moderately  sized  molecules. 

1.4  Parallel  vs.  Serial  Computation 

The  primary  obstacle  to  scaling  to  larger  problem  sizes  for  complex  problems  is 
the  execution  time  usually  exceeds  an  acceptable  level.  Several  hours  or  even  several  day 
of  computation  time,  may  be  an  acceptable  amount  of  time  depending  on  the  importance 
of  the  application.  However,  an  execution  time  of  several  years  is  generally  infeasible 
fi-om  both  a  cost  perspective  as  well  as  a  hardware  reliability  standpoint,  no  matter  how 
important  the  application.  As  has  already  been  stated,  the  processing  speed  of  a  single 
processor  is  currently  approaching  the  physical  limitations  of  the  device.  Consequently,  a 
considerable  effort  is  now  being  expended  on  parallelizing  existing  algorithms  in  the  hopes 
of  achieving  speedup  in  execution  times,  thus  allowing  the  algorithm  to  scale  to  larger 
problem  sizes  (Chandy,  1989). 

Their  inherent  parallelizability  of  genetic  algorithms  is  partially  responsible  for  the 
increasing  popularity.  Several  models  have  been  developed  to  map  the  genetic  algorithm 
to  parallel  architectures  (Gordon,  1993).  The  particular  modd  used  depends  largely  on 
the  type  of  parallel  architecture.  For  distributed  memory  MIMD  (multiple  instruction 
multiple  data)  architectures,  an  island  modd  usually  works  the  best,  where  as  a  cellular 
model  is  conddered  to  be  the  most  qrpropriate  for  fine-grained,  massively  parallel 
arclutectures  (Doiigo,  1993;  Gordon,  1993). 
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1.5  Objective 

The  objective  of  this  research  eff<Mt  is  two  fold.  The  first  objective  is  to 
demonstrate  that  genetic  algorithms  can  be  successfully  applied  to  the  protein  folding 
problem.  The  second  objective  is  to  demonstrate  the  scalability  of  a  paraUd 
implementation  of  a  genetic  algorithm  as  it  is  apf^ed  to  the  protein  folding  problem.  The 
first  objective  involves  devdoping  an  accurate  modd  of  the  potential  oietgy  function  of  a 
molecule  and  incorporating  as  the  fitness  functitm  of  the  goietic  algorithm.  The  second 
objective  requires  the  incorporation  of  the  potoitial  energy  function  into  a  parallel 
implementation  of  a  genetic  algorithm  and  executing  the  code  on  various  numbers  of 
processors  to  collect  speedup  information. 

1.6  Assumptions 

•  The  available  AFTT  software  for  simple  and  messy  parallel  genetic  algorithms  is 
assumed  to  function  properly. 

•  Already  existing  performance  data,  wdiich  is  used  for  comparative  analysis  is 
assumed  to  be  obtained  firom  algorithm  inq>lanentations  ^ch  have  beat 
optimized  for  maximum  performance. 

1.7  Scope 

The  scope  of  this  project  includes  deagning  and  implementing  the  empirical 
potential  energy  function  of  a  molecule.  The  integration  of  this  energy  model  as  the 
fitness  function  of  a  sequential  and  a  paralld  implementation  of  a  anq>le  genetic  algorithm 
is  also  covered.  Comparisons  are  made  with  results  fix)m  published  studies  of  a  simulated 
annealing  iq)proach.  The  problem  domain  conasts  of  the  multi>minima  problon  of  finding 
the  minimum  energy  structure  of  the  polypeptide  [Met]-Enkephalin.  Application  results 
are  analyzed  for  both  serial  and  parallel  implementations  of  the  genetic  algorithm. 
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1.8  StaiKtanto 

The  following  quantifiable  measures  are  used  as  a  minimum  to  conqMre  the  results 
of  the  inq^ementations  of  the  genetic  algorithms,  as  wdl  as  results  fit>m  non-genetk 
qiproaches  to  nmilar  problems. 

•  Number  of  generations  to  find  global  optinaim 

•  Execution  time 

•  Error,  difference  between  global  (q)timum  and  solution  fixmd 

•  Space  considerations 

1.9  Approach/Mttthodology 

1)  Study  literature  pertaining  to  goietic  algorithms,  their  applications,  advantages, 
linutations,  and  current  state  of  performance. 

2)  Study  literature  pertaining  to  the  coitformational  aiudysis  of  proteins,  induding 
general  background,  genetic  algorithm  approaches,  and  other  current  approaches. 

3)  Comprehend  and  analyze  the  genetic  algorithm  software  available. 

4)  Perform  analysis  and  design  of  empirical  potential  energy  fimction. 

5)  Perform  analyds  and  design  of  molecular  representation  schnne  to  encode  the 
conformation  of  a  molecule  as  a  biiuuy  string. 

6)  Inqrlement  enqrirical  potential  oiergy  fimction,  aixl  int^rate  iitto  existing  code  for 
simple  and  parallel  sinqrle  genetic  algorithms,  uang  the  sequential  version  of  tlw 
genetic  algorithm  as  a  develop  nent  environment. 

7)  Implement  representation  scheme  for  a  molecular  conformation. 

8)  Validate  the  potentiai  energy  model  by  conq>aring  calculated  en^es  with  those 
obtained  using  the  molecular  simulation  software  package  called  CHARMm. 
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9)  Test  both  the  siiiq>Ie  and  pandld  single  genetic  algorithnis  against  an  energy 
minimization  problem  and  analyze  results  to  determine  the  trideofib  between  the 
two  algorithms. 

10)  Compare  resuhs  with  published  rq)orts  for  simulated  annealing  approadi  and 
poform  a  quantified  comparison  between  the  performance  of  the  two  iq^roaches. 

1 1)  Summarize  results  and  discuss  pros  and  cons  of  the  two  algorithms.  Discuss  the 
robustness  of  the  algorithms  with  respect  to  the  potential  for  scaling  to  latga* 
parallel  systems  and  larger  problem  sizes. 

12)  Discuss  direction  and  recommendations  fi>r  fimire  work. 

1.10  Materials  and  Equipment 

All  development  is  accomplished  on  AFTT  woricstations  and  the  AFTT  iPSC/2 
Hypercube.  Additional  indght  into  the  effect  of  parallelization  of  the  algorithm  is  gained 
by  porting  the  code  over  to  larger  systems,  including  the  Intel  iPSC/860  Hypercube 
conqniter  in  Beaverton,  Or^on.  Running  the  algorithms  on  various  size  machines 
provides  an  opportunity  to  observe  the  effects  of  scaling,  and  workload  balancing  of  the 
algorithms. 

1.11  Summary 

Computer  hardware  is  fast  approaching  physical  limitations  which  bound  the 
amount  of  computational  powo*  available  fi'om  a  :>ingle  processor.  Yet,  computers  are 
seen  as  essential  tools  to  solve  many  problems  of  exponential  complexity,  such  as  the 
protdn  folding  problem.  For  such  problons,  a  completely  guaranteed  solution  my  not  be 
posnble,  given  their  enormous  complexity.  Consequently,  good  qrproximation 
algorithms,  iqrplicable  to  a  large  variety  of  intractable  problems,  are  necessary  in  order  to 
reduce  the  overall  complexity  of  finding  at  least  a  good  solution.  These  algorithnis  should 
also  be  highly  paralldizable  in  order  to  take  advantage  of  the  full  qieedup  offered  by 
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paraUel  processors.  Genedc  algoriduiis  are  viewed  as  a  potentially  promising  teclmique 
for  applications  involving  exponential  order  problems. 

1.12  Layout  Of  ThMis 

This  first  chi^rter  has  already  introduced  the  topics  of  gemtic  algoritluns.  the 
protein  fol^ng  proUem,  and  paralld  procesang.  Chapter  n  is  a  literature  review 
containing  badcground  information  on  genetic  algorithms,  current  limitations  of  them,  and 
finally,  a  review  of  current  apfHoadies  to  paralldizing  genetic  algorithms.  Cluqner  in 
ccmtains  a  summaiy  of  current  iqjivoaches  to  conformafimud  analysis,  an  analysis  of  the 
potential  energy  minimization  ^}|»t)adi  to  the  inotein  folding  proUem,  as  wdl  as  a 
discussion  of  encoding  schones  to  lefnesent  the  conformational  search  space  of  a  protein. 
Clu^)ter  IV  details  the  experimental  deagn  and  inqdementation  of  the  potential  energy 
fiinction,  the  representation  of  the  problon,  and  the  integration  of  the  design  imo  the 
genetic  algorithm.  Charter  V  provides  the  results  obtained  fiom  the  genetic  algorithm 
applied  to  the  conformational  aiulysis  of  the  protein  [Met]-Enkq)halin.  Finally,  duqHer 
VI  concludes  with  some  final  obsovations  and  recommendations  fi>r  future  rdated  efforts. 
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11.  Literaiure  Review  of  Genetic  Algorithms 


Although  the  c<mceptofgeneticalgonthins  is  relatively  new  (HoUand,  197S),  there 
is  a  growing  foundation  of  literature  devoted  to  the  theory  of  evohitkMiafy  a^orithms  and 
their  ^>plication  to  a  variety  of  problem  domains.  The  first  intematkmal  conference  on 
genetic  algorithms  was  hdd  in  1985,  and  has  since  contributed  significantly  toward  the 
promotion  of  evohititmary  computation.  The  proceedings  of  these  conferences  (novide  a 
wealth  of  information  potaining  to  both  theory  and  ^rfdication.  The  fiidlowing  sections 
provide  a  brief  review  of  rdevant  background  information,  a  discussion  of  current 
limhationa,  and  a  summary  of  various  approadres  to  paralldize  genetic  algorithms. 

2.1  BackgrourKl 

The  priinaiy  distinction  between  various  types  of  computer-based  seardi 
algmithms  b  the  manner  in  winch  the  |»oUem  ^>ace  is  ejqdored.  Traditional  search 
algorithms  can  be  categmized  as  either  calcuhis-based,  enumerative,  or  random 
(Goldberg,  1989).  Calculus-based  algorithms  sedc  to  either  iixlirectly  derive  the  optimal 
solution  by  solving  a  s^  of  equations,  or  directly  find  the  maximum  of  the  function  by 
moving  in  the  direction  of  the  highest  slope.  An  emimerative  algorithm,  Driudi  is  only 
suitable  for  finite  domains,  simidy  emimerates  or  tests  every  pi^  within  the  proUem 
domain.  A  random  search  techiuque  is  rimilar  to  emimerafion;  however,  only  random 
points  within  the  domain  are  tested.  Gmietic  algorithms  are  seardi  based  semi¬ 
optimization  algorithms  which  are  similar  in  some  reflects  to  random  search  based 
algorithms.  The  important  difference,  however,  is  genetic  algorithms  use  a  stochastic 
sampling  procedure  vhich  results  in  a  more  directed  seardi  strategy  than  a  purdy  random 
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search.  Genetic  algorithms  have  been  shown  to  achieve  good  resuhs  in  a  variety  of 
applications  including  combinatorial  minimization,  such  as  tlM  traveling  salesnum  problem 
(Homaifitf,  1993;  Jog,  1989),  the  set  covering  prd>lem  (Sen,  1993),  and  functional 
optimization  (Goldberg,  1987;  Janikow,  1990;  Muselli,  1992). 

2.1.1  General  Deecripdon.  The  basic  genetic  algorithm  conrists  of  at  least  three 
operators — reproduction,  crossover,  and  mutation,  modded  after  the  natural  evolutionary 
processes  of  selection,  mating,  and  mutation  respectively  (Goldberg,  1989;  Holland, 
1975).  Genetic  algorithms  rimulate  the  Darwin  theory  of  survival  of  the  fittest  by 
representing  the  search  space  as  a  population  of  strings  upon  which  genetic  operators  act 
to  create  new  generations  of  strings.  The  application  domain  is  encoded  as  a  string 
structure  representative  of  chromosomes.  Through  the  application  of  selection  opo^Uors 
to  an  initial  population  of  string  structures,  a  new  geno^on  is  formed  called  offspring. 
Fitness  values  are  calculated  for  each  member  of  the  population  by  applying  a  fitness 
fimction  to  each  individual  string  of  the  populafion.  The  strings  with  the  highest  fitness 
values  wiU  be  kept  for  hreedir^  the  next  generation.  The  basic  pseudo  code  for  a  genetic 
algorithm  is  presented  below. 

initialize  population 

calculate  fitness  for  all  members  of  the  population 
for  i  =  1  to  max_number_of_generations 
for  j  =  1  to  population_size 
crossover 
mutation 
evaluate  fitness 
end  loop 
reproduction 
end  loop 


Reproduction  employs  a  selection  strategy  to  determine  vdiich  strings  will  survive 
or  be  copied  over  into  the  next  generation.  The  next  generation  then  is  made  up  of  copies 
of  strings  with  high  fitness  values  which  forms  the  mating  pool  for  the  following 
generation.  The  crossover  operator  mates  a  pair  of  strings  by  randomly  selecting  a 
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Figure  2.  Example  of  Single  Point  Crossover 


crossover  point  along  the  length  of  the  string  and  swi4>ping  the  string  sequence  from  the 
crossover  point  to  the  end  of  the  string  (see  Figure  2).  The  purpose  of  the  crossover 
operator  is  to  create  new  strings  using  pieces  of  the  strings  from  the  previous  generation. 
The  goal  then  is  to  combine  the  good  parts  of  one  string  with  the  good  parts  of  another 
string  to  form  a  new  string  which  is  better  than  each  of  the  individual  parcmts.  The 
mutation  operator  randomly  selects  a  string  within  the  population  and  alters  part  of  the 
string  (see  Figure  3).  Just  as  in  nature,  the  mutation  operator  is  applied  only  poiodically. 
By  applying  the  genetic  algorithm  operators  to  an  initial  population  and  simulating  the 
process  of  natural  selection  over  many  generations,  a  final  population  of  the  fittest  strings 
is  formed.  The  fitud  solution  found  by  the  genetic  algorithm  corresponds  to  the  string 
representation  which  possesses  the  highest  fitness  after  an  appropriate  number  of 
evolutionary  cycles. 

2.1.2  Conqtlejdty  Anafym.  The  genetic  algorithm  is  of  polynomial  order 
complexity  with  a  finite  space  requirement  determined  by  the  population  size.  Although 
Goldberg  (1989)  suggests  that  there  is  an  optimal  population  size  which  depends  upon  the 
length  of  the  string,  there  is  no  fixed  dependence  between  the  length  of  the  string  and  the 
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Figure  3.  Example  of  a  Mutation 

population  size.  Experimental  evidmce,  howevo-,  suggests  that  an  insufficioit  population 
aze  may  adversely  affect  solution  quality  (Goldberg,  1989  and  Meride,  1992).  The  terms 
in  the  order  of  the  genetic  algorithm  reflect  the  length  of  the  string  as  wdl  as  tl^  number 
of  strings  in  the  population.  An  entire  cycle  of  the  genetic  algorithm  is  executed  up  to  a 
maximum  number  of  generations  specified  by  the  uso*. 

The  various  genetic  operators  each  have  associated  minimum  complexities, 
although  the  actual  complexity  depends  on  the  implementation.  The  crossover  operator 
selects  two  strings  fi-om  the  population  pool  of  n  strings,  picks  a  random  location  along 
the  length  of  the  string  of  /  bits,  and  then  swqss  the  two  tails  of  the  parent  strings  vdiich 
follow  the  randomly  selected  crossover  point.  The  number  of  crossovers  per  generation  is 
0(/i),  and  the  actual  crossover  operation  can  be  contidered  to  be  0(0  since  it  needs  to 
traverse  the  length  of  the  string. 

The  fitness  function  includes  a  call  to  decode  the  string  representation  into  the 
value  in  the  problem  domain.  This  decode  call  could  be  an  0(1)  or  an  0(0  function, 
depending  upon  the  string  representation  scheme  and  the  progranuning  environmoit 
capabilities.  Evaluating  the  fitness  function  is  an  0(1)  operation,  vnth  respect  to  the 
populations  size,  since  it  simply  substitutes  the  decoded  string  values,  and  evaluates  the 
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objective  function.  However,  for  complex  problems,  the  evaluation  of  the  fitness  has  a 
lower  bound  of  the  order  of  the  objective  function — the  fiinction  being  optimized. 

Mutation  also  could  be  an  0(1)  or  an  0(n)  operation  depending  upon  the 
particular  implemmitation  and  also  the  mutation  strata  being  used.  Studies  have  shown 
good  results  are  obtained  with  a  mutation  rate  of  once  for  every  thousand  string  position 
transfers  (Goldberg,  1989). 

The  selection  function  has  an  0(n)  complexity  since  it  must  evaluate  each  member 
of  the  population  to  determine  which  strings  will  be  carried  on  to  the  next  gmeration. 
Actual  implementations  of  the  selection  operator  may  be  of  0(n^)  complexity,  such  as  a 
roulette  wheel  q)proach  biased  according  to  the  fitness  of  the  strings. 

2.LJ  Niche  Theory.  While  optimizing  over  a  search  space  which  contains  many 
near  optimal  solutions,  ^ple  genetic  algorithms  have  been  found  to  be  susceptible  to 
genetic  drift  (De  Jong,  1975;  Goldberg  &  Segrest,  1987).  Genetic  drift  happens  when  the 
stochastic  error  associated  with  the  genetic  opmtors,  in  conjunction  with  competing 
schemata  corresponding  to  different  peaks  within  the  search  space,  causes  the  population 
to  converge  to  one  of  the  peaks  (Deb,  1989).  To  overcome  this  pitfall,  several 
implementations  of  genetic  algorithms  have  been  developed  which  attempt  to  maintain 
multiple,  stable  subpopulations  on  different  peaks  within  the  search  space  (Dd>,  1989; 
Goldberg,  1989). 

The  theoretical  foundation  for  maintaining  separate  subpopulations  is  based  on  the 
limited  resoivces  model  which  can  be  observed  among  natural  organisms.  In  nature, 
organisms  rarely  compete  for  the  same  resource;  rather,  th^  coexist  while  exploiting 
different  resources  within  the  same  environment.  The  relative  uze  of  the  population  of  a 
species  corresponds  to  the  availability  of  the  resource  which  it  exploits.  The  balance  of 
the  number  of  individuals  among  any  one  species  is  maintained  by  the  fact  that  all 
resources  are  limited;  if  a  resource  becomes  over  utilized,  starvation  will  correct  the 
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imbalance  until  an  equilibrium  is  reached.  The  resulting  system  produces  stable 
subpopulations  of  species  which  exploit  thdr  own  region  or  niche  within  the  overall 
environment. 

The  analogy  between  natural  systems  and  goietic  algorithms  demonstrates  tlM 
benefits  gained  by  allowing  the  search  space  to  be  explored  simultaneously  within  sevo^ 
different  niches  of  the  search  space.  The  number  of  population  members  allocated  to  each 
niche  is  proportional  to  the  relative  fitness  of  the  niche,  corresponding  to  the  availability  of 
the  resource.  The  balance  of  the  number  of  individuals  allocated  to  the  various  niches  is 
maintained  throughout  the  search  process  in  two  ways.  The  relative  fitness  of  one  niche 
can  decline  with  respect  to  other  niches  which  prove  to  yield  better  solutions,  thus  cau^g 
a  proportionally  smaUer  number  of  individuals  to  be  allocated  to  the  niche  with  lesser 
performance.  The  competition  for  limited  resources  also  affects  the  probability  of  survival 
of  any  one  individual.  If  a  large  portion  of  the  population  is  exploiting  one  area  of  the 
search  space,  the  survival  rate  of  any  one  individual  will  be  proportionally  smaller,  even 
though  the  area  of  exploitation  may  correspond  to  a  niche  of  high  fitness.  This 
competition  for  limited  resources  provides  the  genetic  algorithm  with  the  mechanism  to 
prevent  the  population  fi'om  converging  upon  one  solution. 

2.1.4  Modek  and  Niche  Formation.  There  are  basically  two  methods  for 
incorporating  a  niche  strategy  in  genetic  algorithms.  The  first,  proposed  by  De  Jong 
(1975),  is  called  the  crowding  scheme  which  implements  a  generalized  form  of 
preselection  to  determine  which  population  member  is  replaced  by  a  new  offspring.  The 
idea  is  to  replace  an  individual  with  a  similar  individual  of  higher  fitness,  thus  preserving 
population  diversity.  Similarity  is  based  on  a  bit-by-bit  comparison  of  two  strings  to 
determine  the  number  of  bits  in  common.  The  crowding  scheme  works  by  preventing 
members  of  one  niche  from  dominating  the  members  of  another  niche. 
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The  second  method  used  to  create  niches  in  the  search  space  is  called  the  sharing 
scheme  (Deb,  1989;  Goldberg  &  Richardson,  1987).  The  sharing  scheme  acts  as  a  fitness 
scaling  factor.  The  fitness  of  individuals  are  scaled  according  to  the  number  of  other 
similar  individuals  which  considered  to  be  exploiting  the  same  region  or  niche  within  the 
search  space.  In  this  scheme  a  distance  metric  is  used  to  scale  the  fitness  of  an  individual 
such  that  its  expected  survival  rate  is  proportional  to  the  number  and  proximity  of 
neighboring  individuals.  Consequently,  >\4ien  a  large  number  of  individuals  inhabit  the 
same  region  of  the  search  space,  their  corresponding  fitness  will  be  reduced,  resulting  in  a 
proportionally  smaller  number  of  individuals  to  be  allocated  to  that  region  of  the  search 
space.  Likewise,  when  an  individual  of  relatively  low  fitness  inhabits  a  sparsely  populated 
region  of  the  search  space,  its  fitness  will  be  scaled  up,  thus  increasing  it  chances  of 
survival. 

The  source  of  the  distance  metric  used  in  the  sharing  scheme  has  lead  to  two 
different  implementations.  The  distance  used  to  quantify  the  proximity  of  neighboring 
individuals  can  be  determined  at  either  the  phenotypic  level  or  the  genotypic  level.  The 
phenotypic  level  refers  to  the  decoded  parameter  search  space.  For  a  problem  with  n 
parameters,  the  distance  is  usually  calculated  as  the  Euclidean  distance  between  two  points 
in  n-dimensional  space.  The  distance  at  the  genotypic  level  simply  corresponds  to  the 
Hamming  distance  between  two  strings  of  the  population. 

2.2  Limitations  of  Genetic  Aigorithms 

Despite  all  the  advantages  offered  by  genetic  algorithms,  they  do  have  some 
limitations  which  lead  to  performance  shortfalls  when  compared  to  more  traditional  search 
algorithms  such  as  A*,  hill-climbing,  depth-first,  or  breadth-first  search  (Rich,  1991). 
Three  primary  deficiencies  of  genetic  algorithms  are  premature  convergence  on  local 
optima,  lack  of  a  solution  based  stopping  condition,  and  low  precision.  Some  of  these 
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deficiencies  may  be  rectifiable  through  the  use  of  dynamically  sdf-adjusting  genetic 
algorithms  (Aizawa,  1993;  Davis,  1989;  Lee,  1993). 

Currently  there  are  implementations  of  genetic  algorithms  ^ch  incorporate,  at 
least  to  some  d^ree,  a  dynamic  factor  to  adjust  one  or  more  of  the  evolutionary  features 
of  the  goietic  algorithm  as  the  population  evolves.  The  following  sections  focus  on 
current  implementations  of  genetic  algorithms  which  incorporate  a  dynamic  element  within 
their  genetic  operators  in  order  to  adapt  die  search  strata  to  overcome  obstacles  during 
various  stages  of  the  evolutionary  search  process. 

2.2,1  Premature  Convargeuce.  Pronature  convergence  h^pens  i^idien  a  search 
through  a  search  space  becomes  tnqjped  in  a  locally  optimal  solution  state  and  is  unable  to 
locate  the  globally  optimal  solution.  This  problem  is  particularly  important  for 
applications  where  many  local  minima  odst,  such  that  finding  the  globally  optimal 
minimum  becomes  exceedingly  difiBcuh  without  exploring  the  entire  search  space. 
Although  other  algorithms  are  just  as  susceptible  to  this  phenomenon,  it  may  be  possible 
to  avoid  premature  convergence  with  genetic  algorithms  by  modifying  the  way  in  ^ch 
locally  optimal  states  are  explored. 

In  order  for  genetic  algorithms  to  gain  wider  acceptance  as  a  valid  and  reliable 
approach  to  solving  real-world  problems,  efforts  must  be  made  to  reduce  the  likelihood  of 
premature  convergence,  and  to  increase  tlw  level  of  precision.  The  properties  and 
characteristics  of  genetic  algorithms  are  still  not  fully  understood.  Consequently,  the  exact 
causes  of  premature  convergence  are  unknown.  Studies  have  indicated  evidence  to 
surest  there  may  be  a  connection  between  population  size  and  the  probability  of 
premature  convergence  (Merkle,  1992;  Moed,  1991).  Although  these  studies  are  not 
conclusive,  and  the  data  tq)pears  to  be  somewhat  problem  dependent,  there  is  a  direct 
relationship  between  population  size  and  the  tradeoff  between  exploration  and  exploitation 
(Moed,  1991).  Convergence  upon  a  local  optima  can  have  a  profound  impact  upon 
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overall  solution  quality.  A  string  of  unusually  hi^  fitness,  found  early  in  the  seardi 
process  can  quickly  dominate  the  population  as  a  resuh  of  the  proportional  allocation  of 
the  selection  operator.  In  a  parallel  processing  environment,  migration  strat^es  can  also 
accelerate  premature  convo^ence  by  spreading  locally  optimal  strings  to  other  sub¬ 
populations  (M^kle,  1992). 

In  an  implementation  by  Reynolds  (1990),  the  genetic  algorithm  was  modified  to 
include  a  dynamic  fitness  scaling  fiurtor.  Soding  factors  are  often  used  to  reduce  the 
fitness  of  a  string  in  order  to  promote  the  continuation  of  the  search  (Buckles,  1990, 
Reynolds,  1990).  Reynolds  justified  the  use  of  a  scaling  factor  with  the  assumption  that 
no  single  individual  can  represent  the  solution,  rather  the  population  must  evolve  to 
contain  an  entire  generation  of  highly  fit  strings.  LeGrand  and  Merz  (1993)  chose  a 
different  selection  operator  altogether,  based  on  rank  rather  than  fitness.  This  rank  based 
approach  is  believed  to  reduce  the  problems  assodated  with  a  string  of  unusually  high 
fitness  being  asdgned  a  disproportionate  amount  of  the  population  ^ace. 

A  variation  of  genetic  algorithms,  call<  4l  messy  genetic  algorithms,  have  been 
developed  in  part  to  overcome  problems  of  premature  convergence  (Goldberg,  Korb,  & 
Dd),  1989;  Goldberg,  1990;  Goldbei^  1991).  The  messy  genetic  algorithm  is  based  on 
an  assumption  that  the  deceptiveness  of  the  problem  is  bounded.  A  problem  is  considered 
deceptive  if  short,  low-order  building  blocks  lead  to  suboptimal  higher  order  building 
blocks  (Goldberg,  1989).  One  phase  of  the  messy  genetic  algorithm,  called  the  primordial 
phase,  is  designed  to  find  the  building  blocks  with  the  highest  fitness  values  which  will 
presumably  be  in  the  solution.  The  next  phase,  called  the  Juxtapositional  phase,  evaluates 
the  various  combinations  of  building  blocks  to  determine  the  combination  which  yields  the 
highest  solution  quality.  Messy  genetic  algorithms  are  more  complicated  than  simple 
genetic  algorithms,  requiring  both  more  time  and  more  memory.  Also,  the  primordial 
phase,  which  constitutes  the  majority  of  the  ^ecution  time,  is  inherently  non-parallelizable 
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on  distributed  memory  architectures;  however,  approximations  of  the  primordial  phase  do 
exist,  which  can  be  paralldized  (Merkle,  1992). 

2.2.2  SUqtping  Condition.  Users  of  genetic  algorithms  are  forced  to  qpedfy  an 
artificial  stopping  condition  corresponding  to  the  desired  number  of  genovtions  to  create. 
VTith  each  successive  generation,  the  solution  quality  of  the  population  evolves  to  higher 
and  higher  levels  of  fitness.  There  is,  however,  no  obvious  point  at  vdiich  the  algorithm 
completes,  or  otherwise  indicates  that  no  fiirfiier  generations  are  necessary.  Typically,  the 
user  specifies  some  arbitrary  numbo'  of  generations  to  perform,  after  vdiich  the  final  or 
best  population  found  is  accepted  as  the  solution.  It  is  difficult  to  estimate  the  number  of 
generations  required  to  evolve  a  solution  of  the  derired  quality;  consequently,  the  number 
of  generations  is  usually  set  much  higher  than  the  actual  number  needed.  The  trade-off 
which  exists  is  between  solution  quality  and  execution  time.  An  insufficient  number  of 
generations  may  result  in  a  poor  solution  if  it  has  not  evolved  long  enough;  excess 
generations  waste  computer  time  by  perfomung  evolutionary  cycles  which  do  not  result  in 
improved  solutions.  Since  the  algorithm  is  of  polynomial  order  time  complexity,  the 
tradeoff  usuaUy  favors  wasting  computer  time  to  ensure  an  adequate  number  of 
evolutionary  cycles  are  performed.  However,  even  a  conservative  iq>proximation  based  on 
previous  observation  does  not  guarantee  an  optimal  stopping  point. 

There  have  been  implementations  of  genetic  algorithms  which  facilitate  a  ample 
automatic  stopping  condition  which  allows  the  algorithm  to  cycle  through  the  evolutionary 
process,  creating  new  generations  until  a  threshold  number  of  cycles  have  passed  in  vdiich 
no  new  optimal  solution  has  been  found.  The  implementation  by  Reynolds  (1990) 
contained  an  exit  condition  which  terminated  the  genetic  algorithm  after  a  ^>ecified 
number  of  generations  where  all  members  of  the  population  maintained  a  specified  fitness 
level.  In  other  words,  as  the  algorithm  proceeds,  the  overall  fitness  of  the  population 
increases,  until  a  point  where  the  population  contains  only  strings  of  high  fitness. 
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Furthermore,  to  maintain  a  high  fitness  value  for  a  number  of  generations,  any 
recombination's  of  strings  within  the  population  must  also  be  of  a  high  fitness.  Assuming 
the  best  solution  found  so  fin  is  being  stored  and  updated  throughout  the  genetic 
algorithm,  a  simple  way  to  fiunlitate  an  automatic  stopping  condition  is  to  allow  the 
algorithm  to  cycle  through  the  evolutionary  process,  creating  new  goierations  until  a 
threshold  number  of  cycles  have  passed  in  which  no  new  optinud  solution  has  been  found. 
LeGrand  and  Merz  (1993)  used  a  multiple  exit  criteria  to  terminate  the  search  when  any  of 
the  following  conditions  were  met:  100,000  goierations  without  finding  a  imw  solution  to 
replace  the  previously  best  solution,  variance  of  the  population  fitness  falls  bdow  0.1,  or 
the  distance  between  200  randomly  selected  p^  &lls  below  a  specified  amount.  The 
second  and  third  of  these  exit  criteria,  like  tlm  strategy  used  by  Reynolds,  exploits  some 
measure  of  the  population  convergence  to  generate  a  terminating  condition. 

12.3  Preci^H.  A  final  limitation  of  genetic  algorithms  is  the  lack  of  precision  in 
exploitir^  a  minimum  state.  Often,  a  genefic  algorithm  is  able  to  locate  the  general 
location  of  a  local  or  global  minimum,  but  the  nature  of  the  genetic  operators  make  it 
difficult  to  narrow  in  on  the  specific  minimum  of  the  surface  or  function  (Janikow,  1991). 
A  typical  approach  to  this  problem  is  to  use  genetic  algorithms  to  locate  the  vicinity  of 
minimum  states  to  limit  the  search  space  and  then  apply  a  different  algorithm,  such  as  a 
gradient  based  algorithm,  to  perform  the  fine  optimization. 

To  improve  precision,  genetic  algorithms  have  been  used  in  conjunction  with 
gradient-based  algorithms  to  maintain  the  robustness  of  genetic  algorithms,  yet  gain  the 
fine  precision  ef  a  directed  search  algorithm.  A  study  by  Janikow  and  Michalewicz  (1990) 
implemented  a  specialized  genetic  algorithm  able  to  achieve  precision  by  combining  a 
floating  point  representation  scheme  with  what  was  referred  to  as  a  dynamic  mutation 
rate,  which  decreased  with  time.  The  effect  of  the  adjustable  mutation  rate  was  a  more 
localized  search  with  each  successive  generation.  Janikow  and  Michalewicz  argued  a 
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decreasing  rate  of  mutation  causes  the  population  to  be  searched  very  locally  at  lata* 
stages  of  the  evolutionary  process.  Similarly,  Atmar  (1990)  claimed  decreeing  the 
mutation  rate  stabilized  the  genetic  information,  thus  avoiding  the  dimiptive  nature  of 
mutation  on  an  already  highly  fit  string.  TIm  resuhs  of  experiments  performed  by  Janikow 
and  Michalewicz  show  thm  specialized  genetic  algorithm  outperforms  clastical 
i^)proaches  in  terms  of  both  precision  and  the  required  number  of  generations.  They  also 
found  the  degree  of  predtion  achieved  by  thdr  implementation  compared  fiivorably  to 
more  traditional  gradient-based  approaches. 

2.3  Parallel  Genetic  Algorithms 

There  are  a  number  of  strategies  for  moping  of  genetic  algorithms  to  distributed 
memory  architectures.  Selection  of  a  particular  approach  should  take  into  consideration 
certain  features  of  the  target  architecture: 

•  the  time  tcomm  required  for  a  fixed  amount  of  interprocessor  communication, 

•  the  computation  time  lc  required  for  a  single  fitness  fimction  evaluation,  and 

•  the  number  of  available  processors. 

Of  particular  importance  is  the  ratio  Uting  this  ratio,  it  is  possible  to  determine 

the  amount  of  interprocessor  communication  which  may  be  performed  without  sacrificing 
processor  utilization.  Typically,  tcomm  implying  that  very  little  interprocessor 
communication  may  be  performed  without  impacting  execution  time. 

One  strategy,  which  has  very  flexible  interprocessor  communication  requirements, 
is  the  "island"  model,  wherein  the  population  is  partitioned  imo  subpopulations  (Dorigo, 
1993;  Gordon,  1993).  Each  processor  executes  an  independent  genetic  algorithm 
operating  on  a  single  subpopulation.  No  interprocessor  communication  is  required  for 
fitness  evaluation.  Various  global  selection  and/or  migration  strategies  may  be  employed. 
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Experimental  results  indicate  that  tradeofib  exist  between  solution  quality  and  required 
communication  time. 

A  less  common  strata  is  the  "ftmung”  model,  in  which  the  vriiole  p(^xilation 
resides  on  a  single  processor  (Dorigo,  1993).  This  processor  (the  master)  applies  the 
genetic  (iterators  to  the  population,  while  the  remaimng  processors  (the  slaves)  perfixm 
the  fitness  fimction  evahiatioiis.  As  the  master  processor  creates  a  new  individual,  it 
requests  one  of  the  slave  processors  to  evaluate  hs  fitness.  After  evaluating  the  fitness  of 
the  population  niend)er,  the  slave  returns  the  fitness  to  the  master.  Knee  0(n) 
communications  are  necessary  at  every  genaadon,  this  strata  is  best  suited  for 
applications  in  which  tg  ^tcomm- 

Recently,  many  parallelized  versions  of  several  molecular  dyiuunics  programs, 
including  CHARMm,  have  become  available.  This  offers  many  difio’ent  possibilities  for 
decomposing  a  GA  ^)proach  to  the  structure  prediction  problem.  For  exanq>le,  a  single 
global  population  could  be  executed  on  the  host  i^diere  the  evaluation  of  fitness  for  each 
population  member  is  executed  in  paralld  on  the  node  processors.  This  would  in  effect 
speed  up  the  evaluation  cyde  of  the  GA  However,  one  must  consida*  the  added 
communications  required  to  communicate  each  population  meniba*  to  the  iKxle  processors 
for  every  generation. 

Parallel  implonentations  of  genetic  algorithms  provide  a  perfect  setting  for 
encouraging  the  formation  of  niches  within  the  search  qpace.  The  subpopulations  present 
on  each  node  of  a  multiprocessor  environment  pro^de  a  natural  mechanism  to  implidtly 
search  separate  niches  of  the  search  space,  as  opposed  to  explicitly  forcirig  the  formation 
of  niches  within  a  single  population  in  a  serial  genetic  algorithm  implementation. 
However,  some  communication  nuy  be  necessary  between  nodes  in  order  to  prevent  a 
rignificant  amount  of  duplication  between  subpopulations. 
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2.4  SuiraiMury 

Although  h  has  been  almost  twemy  years  since  genetic  algmithms  woe  first 
established  by  John  Holland  (1975)  as  a  valid  tedinique  fi>r  searching  conq>lex  ^>aces,  it 
has  not  been  until  recently  that  attention  has  focused  on  using  geiwtic  algorithms  as 
general  problem  solvers  for  a  wide  variety  of  applications.  Already  genetic  algcnrithms 
have  been  shown  to  achieve  good  solutions  for  solving  compile  business,  sdence,  and 
engineering  problons.  As  the  variety  of  ^>plications  increases,  solving  some  of  the 
current  limitations,  induding  premature  convo^oice,  termination,  and  predsion,  becomes 
all  the  more  important.  Previously,  limitations  in  the  performance  of  genetic  algorithms  in 
a  specific  iq)plication  could  be  compensated  by  the  int^ration  of  domain  q)ecific 
knowledge.  However,  to  allow  genetic  algorithms  to  be  used  as  general  purpose  tools, 
the  strategies  used  in  the  evolution  process  must  be  independent  of  the  iq>plication 
domain. 

The  paralldization  of  genetic  algorithms  potentially  offers  an  attractive  means  for 
achieving  near  linear  q)eedup  using  paralld  processing.  The  ability  to  tailcM-  the 
communication  strategy  of  a  genetic  algorithm  to  the  particular  paralld  architecture, 
allows  the  time  penalty  of  communications  to  be  minimized.  Consequently,  the  most 
natural  method  of  paraUel  decomposition  of  the  genetic  algorithm  is  to  distribute  the  total 
population  as  a  set  of  subpopulations  on  each  node,  where  the  fitness  of  each  member  of 
the  subpopulation  is  calculated  locally.  It  is  the  independence  of  the  subpopulations  wUch 
allows  a  parallel  GA  to  achieve  near  linear  q[)eedup.  Even  though  the  evaluation  of  the 
fitness  may  be  paralldizable,  it  is  unlikely  that  the  interprocessor  communications  required 
to  perform  an  evaluation  in  parallel  could  be  less  than  the  communications  between 
subpopulations. 

As  the  present  undo’standing  of  genetic  algorithms  grows,  knowledge  of  the 
effects  of  genetic  operators  on  solution  quality  and  performance  will  allow  better  search 
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strat^es  to  be  developed.  Currently,  dynamic  opmtors  hold  promise  for  inqjrovii^  the 
performance  of  goietic  algorithms  by  adjusting  thnr  influence  to  reflect  the  stage  of  the 
evolutionary  process.  A  dynamic  sdection  process  may  reduce  the  likdihood  of 
pranature  convergence  on  a  locally  optimal  solution  by  prevoiting  strings  of  unusually 
high  fitness  fi’om  dominating  the  population  early  in  the  search  process.  A  dynamic 
termination  condhimi  may  diminate  the  need  to  q)edfy  an  arbitrary  numbo’  of  genoitions 
by  gauging  the  level  of  convergence  of  the  population.  A  dynamic  mutation  rate  may 
allow  more  precise  solutions  by  reducing  thdr  disruptive  influence  during  the  final 
optimization  stages  of  the  search.  Peiiuq)s  the  biggest  ben^t  of  implementing  dynamic 
operators  is  that  the  algorithm  may  be  able  to  adjust  itself  to  a  wide  variety  rtf'  proUem 
domains,  without  the  intervention  of  a  domain  expert. 
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III.  Anafysis  of  the  Protein  FoUir^  Problem 


A  protdn  consists  of  a  sequence  of  amino  adds.  Each  amino  add  is  identified  by 
an  attached  chemical  structure  called  a  sidechain  (LeGrand,  1993).  The  commcm  pwtion 
of  an  amino  add,  to  which  the  side  duun  or  residue  is  attadied,  forms  the  iMckbone  of  the 
protein  molecule  (see  Figure  4).  The  sequence  of  amino  adds,  linked  together  by  peptide 
bonds,  forms  what  is  referred  to  as  the  primary  structure  of  a  protein.  Aumnc  forces  and 
interactions  between  bonded  atoms  cause  the  primary  structure  to  fold  into  a  three 
dimensional  structure,  known  as  the  tertiary  structure.  Contained  within  the  tertiary 
structure  are  distinguishable  structural  cmrqwnents,  such  as  alpha  hdices  or  beta  sheets, 
^ch  are  called  secondary  structures.  The  final  tertiary  structure,  vdio-e  ail  bonded  aixl 
non-bonded  interactions  are  stable,  is  called  the  molecular  conformation.  The  protein 
folding  ptxfolem  can  be  described  as  a  search  for  the  natural  tertiary  structure  of  a  protein, 
given  its  primary  structure. 
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3.1  Background 

Protons  are  an  integral  part  of  everyday  life.  The  function  of  any  proton  is  largdy 
determined  by  its  tertiary  structure  (Chan,  1993;  Loigauer,  1993).  Fibrous  proteins  are 
found  in  structural  components  of  our  bodies,  v^e  membrane  proteins  mediate  the 
exchange  of  matoials  across  cdhilar  boundaries.  Enzymes,  which  control  virtually  all 
biochemical  reactions  in  living  cdls,  are  globular  protons.  A  considoable  amount  of 
research  today  is  directed  toward  biochemically  engineering  protons  with  desired 
functional  properties.  However,  in  order  to  acoirately  predict  the  functional  propoties  of 
a  protein,  some  knowledge  of  the  tertiary  structure  is  required.  The  development  of  an 
accurate  model  for  predicting  the  tertiary  structures  of  known  proton  sequences  would 
significantly  aid  biochemists  in  their  search  for  protons  with  specific  fiinctiorud  properties. 

Current  technology  provides  a  virtually  automated  means  of  sequencing  a  protein 
to  determine  its  primary  structure.  Consequently,  there  are  about  50,000  known  protein 
sequences  (Bairoch  &  Boeckmann,  1991).  Conversely,  current  technology  such  as  X>ray 
crystallography  and  nuclear-magnetic-resonance  (NMR),  has  only  been  able  to 
experimentally  determine  the  tertiary  structure  of  about  400  proteins  (Bernstein,  et.  al, 
1977).  This  gap  in  structural  information  is  expanding  due  to  the  Hunuui  Genome  Project 
which  at  its  current  rate  is  doublir^  the  number  of  known  proton  sequences  every  year 
(US  Congress,  No.  OTA-BA-373,  1988). 

Another  indication  of  the  scale  of  the  proton  folding  problem  is  the  complexity  of 
the  conformational  search  space.  The  algorithmic  complexity  of  an  optimization  problem 
depends  on  two  factors — the  number  of  variables  in  the  function  to  be  optimized  and  the 
size  of  the  variable  domains.  The  size  of  the  search  ^>ace  is  proportional  to  the  variable 
domain,  but  can  also  be  scaled  by  a  factor  assodated  with  the  derired  precision  of  the 
solution.  To  analyze  a  problem  in  terms  of  computational  complexity,  the  solution  space 
must  be  viewed  as  a  discrete  representation,  since  any  computer  driven  solution  must 
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rq>resent  the  search  space  as  a  discrete  set  of  points.  The  overall  complexity  of  finding 

the  globally  optimal  solution  to  a  problon  tlwn  is  the  time  assodated  with  implicitly  or 

explicitly  searching  the  entire  solution  space.  For  comjriex,  non-linear  problems,  complete 

enumeration  is  the  only  guaranteed  method  for  determining  the  globally  optimal  solution. 

Thus,  the  complexity  of  a  problem  containing  n  variables  denoted  by  x,,  j^,...,x„  would  be 

the  following  product  term. 
n 

0 11^1  II  IKII  is  the  cardinality  of  the  discrete  domain  for  the  variable  x,) 

1=1 

There  exists  a  relationship  between  the  number  of  atoms  in  a  molecule  and  the 
degrees  of  fireedom  associated  with  the  molecular  enorgy.  Each  molecule  has  3n  -  6 
degrees  of  fi’eedom  where  n  is  the  number  of  atoms  contained  in  the  molecule.  This 
relationship  simply  results  fi‘om  assigning  the  location  of  each  atoms  with  respect  to  one  of 
the  molecule's  atoms  chosen  as  the  reference  point.  Each  protein  consists  of  any^ere 
firom  a  few  to  several  thousand  anuno  acids,  and  each  anuno  add  contains  a  number  of 
atoms.  The  resulting  search  space  in  which  a  conformation  can  be  found  is  a  hyperspace 
of  dimensionality  3n  -  6. 

The  exponential  order  of  the  search  space  grows  much  faster  than  2”  or  3”  since 
each  dimension  of  the  search  space  can  be  discretized  to  some  domain  of  possible  values  d 
resulting  in  a  complexity  of  ||d||'^''^  search  space.  Fortunately,  the  bond  lengths  and  bond 
angles  of  a  molecule  are  relatively  stable.  The  prindpal  determinants  of  the  tertiary 
structure  of  a  protein  are  the  dihedral  angles  of  the  molecule  (LeGrand  &  1993). 

Consequently  the  conformational  search  space  of  a  protdn  can  be  reduced  to  Hdll”,  where 
||d||  is  the  number  of  discrete  dihedral  angles  represemed,  and  n  is  the  number  of 
independently  variable  dihedral  angles.  However,  even  for  this  simplified  problem,  a 
complete  enumeration  of  the  total  conformational  search  space  is  infeasible.  For  example, 
consider  a  small  problem  with  only  IS  independently  variable  dihedral  angles  and  a  range 
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of  dihedral  ai^es  from  0  to  360  degrees  discretized  to  20  degree  incronents.  The 
resulting  search  space  is  of  the  order  of  10^^  possible  conformations.  Evoi  using  a 
teraflop  computer  which  evaluates  one  point  tNtxy  clock  cycle,  a  complete  numeration 
technique  would  require  over  78  days  to  search  this  relatively  small  search  space.  Adding 
just  one  more  independent  variable  increases  the  execution  time  to  almost  4  years.  The 
complexity  of  the  problem,  combined  with  the  ever  increasing  rate  at  which  protdn 
sequences  are  being  discovered,  drives  the  need  for  faster  methods  of  determining  the 
tertiary  structures  of  known  protein  sequences.  This  is  the  reason  the  protein  folding 
problem  is  included  as  one  of  the  national  grand  challenges. 

3.2  Current  Methods  of  Structure  Prediction 

There  are  currently  several  approaches  to  finding  the  natural,  three  dimensional 
conformation  of  proteins.  Methods  of  conformational  analysis  include  molecular 
dynamics,  energy  minimization,  homology-based,  and  simplification  techniques. 

3.11  Molecular  Dynandcs.  Perhaps  the  most  natural  approach  is  to  actually 
simulate  the  folding  process.  This  is  the  approach  taken  when  using  molecular  dynamics 
to  model  the  motion  of  atoms  as  an  N-body  simulation,  using  Newtonian  laws  of  motion 
in  response  to  atomic  forces.  The  limiting  factor  to  the  use  of  molecular  dynamics  is  tte 
time  step  required  to  accurately  model  the  motion  of  atoms  as  the  protein  folds.  Thermal 
oscillations  within  a  molecule  require  the  simulation  time  step  to  be  on  the  order  of 
femtoseconds  (10*'^).  However,  the  complete  folding  process  is  on  the  order  of 
milliseconds  or  even  seconds  (Lengauer,  1993).  Current  technology  limits  the  simulation 
of  atomic  trajectories  to  time  windows  of  only  a  few  hundred  picoseconds  (10**^);  much 
too  short  to  determine  the  final  folded  tertiary  structure. 

3.2.2  Energy  Minimization.  Energy  minimization  methods  are  based  on  the 
observation  that  physical  systems  tend  toward  a  minimum  energy  state.  Whether  the 
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natural  conformation  of  a  protein  necessarily  coincides  with  the  global  minimum  energy 
state  of  the  molecule  is  still  undetermined.  Nonethdess,  the  natural  conformation  of  a 
protein  must  coincide  with  at  least  a  locally  minimum  energy  state  in  order  to  be  stable. 
Thus,  the  search  for  the  natural  conformation  equates  to  a  functional  optimization  problon 
where  the  fiinction  to  be  minimized  is  a  model  of  the  energy  of  the  protein  with  a  specific 
three  dimensional  structure.  The  are  two  major  obstacles  which  energy  minimization 
techniques  must  &ce.  The  first  obstacle  is  developing  an  accurate  nuMlel  of  the  energy 
function.  An  exact  model,  accounting  for  all  quantum-mechanical  effects  can  be  an 
expensive  calculation  with  a  time  complexity  as  high  as  0(n^),  where  n  is  the  number  of 
atoms  in  the  molecule  (Lengauer,  1993).  Consequently,  semi-empirical  calculations  are 
usually  used  to  approximate  the  energy  function  as  an  0(n^  operation.  The  second  nuyor 
obstacle  is  to  devise  a  search  algorithm  which  efficiently  samples  the  conformation  search 
space,  yet  avoids  the  multi-minima  problem  of  becoming  trapped  in  local  minima. 

There  are  many  approaches  to  functional  optimization  (Bilbro,  1991).  Differential 
algorithms  use  brute  force  mathematics  to  derive  the  minimum  of  the  function.  This 
approach,  however,  only  works  when  the  function  is  differentiable.  For  non-differentiable 
functions,  gradient-based  or  hill-climbing  algorithms  can  be  used.  A  gradient-based 
approach  uses  a  greedy  type  algorithm  to  direct  the  search  in  the  most  promising 
direction.  The  greedy  algorithm  operates  on  the  basis  of  local  decisions  to  guide  the 
search  toward  the  globally  optimal  solution.  This  approach  works  fine  for  simpler 
functions,  but  does  not  perform  as  well  on  complex  functions  containing  many  minima. 
Consequently,  a  more  robust  search  strategy  must  be  applied  to  avoid  being  trapped  by 
local  minima. 

The  search  space  for  a  function  consists  of  the  domain  of  the  variables  contained  in 
the  function  to  be  optimized.  The  solution  will  be  the  set  of  n  values,  where  n  is  the 
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number  of  variables  in  the  function,  and  y(v„  Vj,  Vj,  v„)  is  either  a  maximum  or 
minimum  solution  depending  on  the  type  of  optimization  bdng  performed. 

Solution  of/x„  Xj,  Xj. ....  x,)  =  [v„  Vj,  Vj, ....  v„] 

The  solution  space  of  the  energy  function  contains  numy  local  minima,  and  is  a 
fundamental  challenge  to  functional  optimization  techniques  to  find  the  globaUy  optimal 
solution.  Various  search  algorithms,  including  Monte  Cario  optimization,  simulated 
annealing,  and  genetic  algorithms,  have  been  used  to  address  the  multi-minima  problem. 

3.2.3  Homology-Based  Approaches.  A  different  type  of  approach  altogether, 
uses  a  more  qualitative  foundation.  The  use  of  homology  to  predict  the  structure  of 
proteins  is  based  on  the  premise  that  molecules  with  similar  primary  structures  will  form 
similar  tertiary  structures  (Lengauer,  1993).  Therefore,  if  the  amino  acid  sequence  of  a 
new  protein  is  found  to  be  similar  to  that  of  a  protein  with  known  tertiary  structure,  then 
based  on  analogy,  the  new  protein  will  possess  a  rimilar  tertiary  structure.  The  obvious 
limitation  of  this  methodology  is  its  inability  to  extrapolate  to  new  proteins  which  differ 
significantly  in  their  prinuiry  structure  fi'om  any  known  protein.  A  variation  on  this  theme 
focuses  on  the  secondary  structures  which  make  up  the  overall  tertiary  structure.  The 
problem  reduces  to  aligning  sequences  of  a  new  protein  to  the  sequences  of  a  protein  with 
known  tertiary  structure.  These  sequences  are  then  assumed  to  correspond  to  the 
secondary  structures  contained  within  the  known  protein.  Although  this  approach  allows 
the  extrapolation  to  conformations  of  unique  and  different  proteins,  it  ignores  the 
interactions  between  secondary  structures  and  their  effects  on  the  overall  tertiary 
structure.  Also,  the  sequence  alignment  problem  is  a  combinatoric  problem  in  itself  to  find 
the  best  match  between  two  amino  acid  sequences. 

3.2.4  Simplification  techniques.  Simplifying  assumptions  have  often  been  used 
to  model  the  complex  behavior  of  large  problems  in  a  feasible  and  representative  manner. 
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The  idea  behind  simplification  is  to  o^ture  tl^  gross  contributions  to  a  complex  problem, 
and  simplify  the  rest  so  as  to  account  for  the  nuyority  of  the  problems  behavior  (Lengauer, 
1993).  A  common  simplification  technique  used  in  conformational  analysis  is  to  restrict 
the  conformation  of  a  protein  to  a  lattice  type  structure  where  the  backbone-carbon  atoms 
(C-alpha's)  are  forced  to  reside  on  the  vertices  of  the  lattice.  The  connections  between 
vertices  of  the  lattice  are  chosen  such  that  they  represent  likely  conformation  angles,  thus 
greatly  reducing  the  conformational  search  space.  Lattice  models  usually  simplify  or  even 
omit  the  modeling  of  the  side  chains  of  the  amino  acids.  Despite  their  many  simplifying 
assumptions,  lattice  models  have  been  fouiul  to  exhibit  protein-like  behavior  when 
modeling  hydrophilic  and  hydrophobic  behavior.  Hydrophilic  proteins  are  polar  or 
charged  and  are  attracted  to  water,  where  as  hydrophobic  proteins  are  oil-like  and  repel 
water  (Chan,  1993). 

3.3  The  Semi-Empirical  Potential  Energy  Function 

The  foundation  for  using  a  semi-empirical  potential  energy  function  is  based  on  a 
tradeoff  between  numerical  accuracy  and  calculation  complexity.  The  semi-empirical 
potential  energy  function  presented  here,  is  derived  firom  extensive  experimental  analysis 
(CHARMm,  1992;  Le  Grand,  1993).  The  parameters  for  the  constant  terms  of  the 
expression  are  also  experimentally  determined.  The  function  takes  into  account  the  non- 
bonded  van  der  Waals  interaction  represented  by  the  Lennard-Jones  potential.  Coulomb's 
law,  and  the  interactions  between  bonded  atoms.  Note  that  for  non-bonded  electrostatic 
interaction,  the  solvent  interaction  between  the  atoms  of  a  molecule  and  the  surrounding 
enxnronment  are  not  included  explicitly. 

The  molecular  energy  equation  is  a  non-linear  function,  whose  parameters  are 
difficult  to  optimize  due  to  the  complex  interactions  taking  place  between  the  atoms  of  a 
molecule  which  the  function  models.  The  simplest  term  of  the  energy  function  includes 
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only  the  non-bonded  interactions  term  of  the  equation.  The  fimction  defined  below  is  the 
Lennard-Jones  potential  which  accounts  for  the  van  der  Waids  attraction  and  repulsion 
energy.  The  last  term  in  the  non-bonded  interactions  fimction  modds  the  dectrostatic 
attraction  and  repulsion  energy. 

F  = 

^non-bonded 

non 

The  terms  and  Bjj  are  empirically  ddermined  constants  for  atoms  i  and  y,  and 
Tij  is  the  distance  between  atoms  /  and  y  measured  in  Angstroms.  The  9/  and  qj  terms 
represent  the  atomic  charges  of  atoms  i  and  y.  The  total  energy  fi-om  non-bonded 
interactions  is  defined  as  a  summation  over  all  pairs  of  atoms  within  the  molecule.  For 
only  two  atoms,  this  function  has  a  very  deterministic  behavior,  but  as  the  molecule  of 
interest  becomes  more  complex,  the  number  of  atomic  interactions  increases  polynomially 
with  each  additional  atom  added  to  the  molecule. 

The  next  step  is  to  add  the  components  of  the  energy  function  which  represent  the 
interactions  along  molecular  bonds.  This  energy  is  represented  by  three  summation  terms 
which  vary  over  the  bond  lengths,  bond  angles,  and  the  dihedral  angles  formed  by  the 
molecular  bonds.  Empirically  derived  constants  are  used  in  the  equation  and  consist  of  a 
leading  constant  coefficient  (K)  associated  t^th  the  types  of  atoms  involved  in  the 
relationship.  An  equilibrium  value  (designated  by  eq)  for  bond  length,  bond  angle  and 
dihedral  angle  is  also  present  and  depends  upon  the  types  of  atoms  in  the  relationship. 
The  individual  terms  of  the  energy  function  are  described  below. 

Bond  length  energy  =  ^  ('(/  “  '’eg 

bonds 

Bond  angle  energy  =  (©,yt  -  0^^ 

angles 


I 

■bonded 


era 
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Dihedral  angle  energy  =  )] 

dihedreUs 


The  total  energy  of  the  molecule  is  equal  to  the  summation  of  the  individual  terms 
which  define  the  bonded  imeractions  plus  the  summation  of  the  non-bonded  imeractions 
terms.  The  total  potential  energy  of  a  molecule  is  defined  as  follows. 


^Total  ”  X  ^eq ) 


bonds 


angles 

Z  ^<t>ijki  [1  +  cos(<D  -  Yiju )]  + 
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3.4  Encoding  of  the  Search  Space 

As  stated  previously,  the  number  of  indq>endent  variables  of  the  energy  fiinction  is 
equal  to  3n-6,  wha«  n  is  the  number  of  atoms  in  the  molecule.  These  independent 
variables  correspond  to  the  bond  lengths  betweo)  bonded  pairs  of  atoms,  the  bond  angles 
formed  by  two  pairs  of  bonded  atoms  with  a  common  atom  fbrmmg  the  vertex  of  the 
angle,  and  the  dihedral  or  tornon  angle  defined  as  the  rotation  angle  along  the  axial  cento- 
(bond)  of  a  chain  of  four  bonded  atoms.  The  bond  lengths  and  bond  an^es  within  a 
molecule  exhibit  relatively  stable  behavior  and  consequently  can  be  accuratdy  modded  as 
a  fixed  variable  based  on  the  atom  types  involved  in  the  bonded  interaction  (Schulze- 
Kremer,  1993).  Tables  are  available  which  pro>nde  empirically  determined  values  for  bond 
lengths  and  bond  angles  for  most  combinations  of  atoms.  The  dihedral  angles  constitute 
the  remaining  independent  variables.  To  furtlwr  refine  the  problem,  attention  can  be 
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focused  on  the  phi,  psi,  and  om^  (O.  T,  and  (d)  angles  (dihedrals  along  tlw  backbone)  of 
the  protein  molecule,  and  the  chi  (x)  angles  of  the  residue  (dihedrals  determining  the 
orientation  of  the  »dechains).  The  renuuning  dihedral  angles  contained  within  the 
chemical  structures  of  the  sidechains  are  held  fixed. 

The  solution  space  for  the  potential  energy  fimction  is  an  n-dimensional 
hyperspace  where  n  is  the  number  of  independently  variable  dihedral  angles.  The  surfiice 
of  the  energy  function  is  characterized  by  many  local  minima,  resulting  in  a  function  which 
is  difficult  to  optimize  (Bilbro  &  Snyder,  1991).  A  ^ple  gradient  based  search  algorithm 
would  likely  become  trapped  in  a  local  minima,  unable  to  proceed  to  find  the  globally 
optimal  solution.  Stochastic  based  search  algorithms  such  as  Monte  Carlo,  simulated 
annealing,  and  genetic  algorithms  have  been  ^own  to  successfiilly  avoid  the  multi-minima 
problem  of  becoming  trapped  in  a  local  minima.  Each  of  these  search  techniques  are  able 
to  use  the  potential  energy  function  to  evaluate  points  in  the  search  space  corresponding 
to  a  specific  conformation  of  the  molecular  structure  bong  minimized.  The  search  space 
consists  of  an  encoding  of  the  independent  variables  necessary  to  fully  specify  the  three 
dimensional  structure  of  the  protein. 

3,4.1  A  Genedc  Algorithm  Encoding  Scheme,  A  method  ^^ch  can  be  used  to 
search  for  the  maximum  or  minimum  of  a  comply  function,  using  genetic  algorithms 
starts  with  constructing  an  encoding  of  the  search  space  as  a  character  string.  This 
involves  two  decisions,  the  cardinality  of  the  alphabet  used  as  the  characters,  c,  and  the 
length  of  the  string,  /.  Both  of  these  decisions  will  affect  the  overall  size  of  the  search 
space  since  the  size  of  the  search  space  is  defined  by  d.  For  functional  optimization,  the 
most  common  string  representation  uses  a  binary  alphabet  to  define  a  string  of  length  /, 
where  /  is  determined  by  the  size  of  the  largest  possible  solution  for  each  of  the  variables 
in  the  function  to  be  optinuzed.  For  example,  a  function  of  two  integer  variables  with  a 
domain  of  0  to  100  could  be  represented  by  a  string  of  length  14,  the  first  seven  of  which 


37 


would  refmsent  the  binary  encoding  of  one  variable,  and  the  second  half  of  the  string 
would  represent  the  other  variable.  Although  this  string  would  actually  represeid  a 
possible  solution  space  of  0  to  127  for  both  of  the  variables,  tl^  simplicity  of  the  string 
representation,  and  the  relatively  small  string  length  make  this  the  most  obvious  and 
practical  choice  of  string  length  and  alphabet.  However,  as  the  size  and  complexity  of  the 
function  to  be  optimized  increases,  a  more  efScient  means  of  representing  tlw  search  space 
may  be  necessary. 

The  grain  of  the  search  space  is  another  consideration  which  may  influence  the 
effectiveness  of  a  representation  schone.  For  example,  if  a  function  contains 
trigonometric  components,  it  may  only  be  feasible  for  solutions  to  exist  on  increments  of  n 
radians,  even  though  the  function  may  be  defined  for  all  real  numbers.  The  grain  of  the 
string  representation  is  very  specific  to  tlw  problem,  and  may  require  a  significant 
understanding  of  the  problem  in  order  to  reduce  the  grain  of  the  representation.  The 
benefit,  however,  of  reducing  the  granularity  of  the  encoded  string,  is  a  much  smaller 
search  space. 

Often,  the  binary  alphabet  is  chosen  since  the  genetic  operators  necessary  for 
crossover  and  mutation  are  easy  to  implement  for  binary  strings.  Also,  the  conversion 
procedure  to  convert  the  encoded  representation  into  the  actual  solution  format  in  order 
to  evaluate  the  objective  function  should  also  be  efficient  since  it  is  executed  often. 
Experiments  by  Janikow  (1990),  using  a  floating  poim  string  representation,  indicate  that 
non-binary  strings  are  also  capable  of  providing  good  results. 


38 


IV  Experimental  Design  and  ImpiemetUatitm 

The  design  of  any  eiqjerinient  requires  a  balance  between  planning  and  a  constant 
awaroiess  of  the  goal  of  the  research  effort.  The  most  carefully  laid  out  eqieriment  is 
meaningless  unless  it  is  able  to  substantiate  the  fundamental  hypothesis  of  the  expoimoit. 
Consequently,  an  experiment  should  be  designed  to  provide  some  quantifiable  measure 
fiom  v^ch  to  perform  statistical  analysis  and  draw  condusitms  (Montgomery,  1976). 

The  following  section  discusses  the  details  of  the  experimental  design;  tte 
inqtlementation  and  validation  of  the  molecular  energy  modd,  the  mt^ration  of  the  nxxld 
into  existing  software,  and  the  program  execution  on  various  hardware  ardiite^ures.  The 
section  condudes  with  a  detailed  presentatitm  of  the  methoddogy  used  to  evaluate  the 
performance  of  tins  GA  ^tproach  to  the  confimnationd  analysis  of  proteins. 

The  rq)proach  used  to  mqmmentally  evaluate  the  performance  of  GA's  for 
conformationd  analysis  involves  a  three  step  process.  The  first  step  is  to  devdq>  an 
accurate  model  of  the  molecular  energy.  Tlw  next  step  is  to  im^rate  the  naodd  into  the 
GA.  Finally,  the  last  step  is  to  design  an  experimoit  winch  evaluates  ibe  poformance  of 
GA's  over  a  variety  of  test  cases. 

4.1  Implementation  of  Molecular  Energy  MocM 

The  molecular  energy  function  is  an  enq^cd  calculation  of  the  molecule's  enogy 
based  on  the  geometry  between  various  sets  of  atoms  within  the  molecule.  Consequently, 
a  data  structure  vdiich  cqatures  the  three  dimensiond  structure  of  the  molecule  is  required 
in  order  to  evduate  this  function.  The  enq>iricd  mergy  function  must  also  have  access  to 
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constant  parameters  which  are  detomined  by  the  types  of  atoms  involved  in  the  eneigy 
term.  Finally,  a  procedure  must  be  devised  to  validate  the  enei:gy  model. 

4,1.1  iHitUUizatioH  €f  Data  Structures.  A  ^steoB  annoadi  to  design  starts  with 
an  analysis  of  the  inputs  and  outputs  of  a  process  to  determine  hs  requirements.  The  input 
61es  used  in  this  researdt  are  generated  by  a  software  padcage  called  QUANTA.  The 
primary  input  to  the  energy  modd  of  a  particular  {Motein  is  the  Z>matrix  rqrresentadon  of 
the  protein.  A  Z-matiix  encodes  the  stnictural  inftmnation  necessary  to  fully  define  the  3- 
dimensional  position  of  each  atom  within  the  molecule.  The  method  in  whidi  this 
structural  information  is  encoded  is  by  defining  eadi  atom's  position  in  terms  of  its  brnid 
length,  bond  angle,  and  dihedral  angle  with  re^rect  to  three  previously  q[>ecified  attMns. 
Generating  a  Z-matrix  with  QUANTA  produces  an  output  file  with  the  following  format 
for  each  line  of  the  file. 

[atom  type]  [bond  length]  [flag]  [bond  angle]  [flag]  [dihedral]  [flag]  [atom J]  [atom_k]  [atomj]  [charge] 

ZHnatiix  Format 

The  Z-matrix  consists  of  a  sequential  listiiig  of  all  the  atoms  present  in  the 
molecule.  Each  line  of  the  Z-matrix  corresponds  to  a  q^edfic  atom  number.  The  atom 
type  is  simply  the  chmiical  symbol  for  the  type  of  atom  presont.  The  bond  length  is  the 
distance  between  the  present  atom  and  the  atom  corresponding  to  the  number  in  the  fidd 
atomj.  The  bond  angle  is  defined  as  the  ai^e  formed  between  the  presort  atom,  atomj, 
and  atom_k.  The  (Uhedral  angle  is  the  twist  or  tordon  angle  alorig  the  central  bond  of  a 
chain  of  four  atoms.  Thus,  the  dihedral  angle  is  defined  as  the  tordon  angle  along  the  axis 
of  the  middle  bond  formed  between  the  present  atom,  atomj,  atom_k,  and  atomj.  The 
flag  fields  present  in  the  Z-matrix  serve  the  purpose  of  identifying  structural  parameters  as 
either  fixed  or  independently  variable.  Although  the  Z-matrix  contains  a  field  for  the 
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charge  of  each  atom,  this  field  is  not  used.  Instead,  QUANTA  generates  a  sqMuate  file,  an 
rtf  file,  which  contains  more  specific  atom  type  infimnation  as  well  as  the  atomic  charges 
corre^KMidii^  to  each  atom  of  the  molecule. 


N 

■.'i'X'X'B 

0 

0 

0.00000 

n 

0 

0 

n 

C 

1.45300 

a 

0 

n 

1 

0 

tl 

C 

1.52877 

il 

107.50007 

0 

0.00000 

0 

2 

1 

0 

C 

1.40904 

0 

111.80808 

0 

-110.07103 

1 

3 

2 

1 

0.000 

C 

1.38714 

0 

120.03789 

0 

89.08887 

1 

4 

3 

2 

0.000 

Example  lines  fiom  a  Z-matrix 


The  above  example  shows  the  first  five  lines  extracted  fiom  a  Z-matrix  for 
[Met]-Enkephalin.  From  this  small  example  the  following  information  can  be  interpreted: 
the  fifth  atom  is  a  carbon  atom,  which  is  1.38714  Angstroms  distant  fiom  the  fourth  atom 
of  the  molecule,  and  forms  a  120.03769  d^ree  angle  between  hself^  the  fourth  and  the 
third  atom,  where  the  fourth  atom  lies  on  the  vertex  of  the  angle.  Likewise,  there  is  a 
torsion  angle  of  89.96887  along  the  bond  between  the  fourth  and  third  atom  with  respect 
to  the  chain  of  atoms  extending  firom  the  fifth  to  the  second  atom.  A  "I"  in  the  flag  fidld  is 
used  to  indicate  an  independoitly  variable  parameter,  while  a  ”0”  indicates  that  the 
parameter  is  hdd  fixed.  The  flags  in  the  example  indicate  the  dihedral  angles  for  the 
fourth  and  fifth  atoms  are  indq)endent  variables. 

In  three  dimenaonal  Cartesian  space,  the  first  atom  of  the  Z-matrix  may  be 
arbitrarily  considered  to  be  at  the  origin.  The  second  atom  may  be  considered  to  be  on  a 
coordinate  axis,  such  that  the  bond  between  the  first  and  second  atoms  lies  on  dther  the  x, 
y,  or  z  axis.  The  position  of  the  third  atom  may  be  conridered  such  that  the  plane  defined 
by  two  unit  vectors  along  the  bonds  to  the  two  previous  atoms  lies  in  dther  the  xy,  yz,  or 
the  xz  plane.  The  Carteaan  coordinates  of  the  fourth  atom,  as  well  as  ar^  subsequent 
atom,  can  then  be  calculated  given  the  distance,  bond  angle,  and  dihedral  angle  between 
that  atom  and  the  three  previously  specified  atomsto  which  it  is  bonded.  Thus,  the  process 
of  calculating  the  Carterian  coordirutes  for  all  atoms  within  the  molecule  requires  the 
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initial  determination  of  the  first  three  atoms  as  reference  points,  and  then  iteratively 
calculating  the  position  of  the  next  atom  with  respect  to  the  previous  three  atoms  to  which 
it  is  bonded.  This  process  includes  rotational  and  translational  transformations  of  the 
Cartesian  coordinates  of  the  inevious  three  atoms,  aligning  them  with  the  origin,  in  order 
to  sin^lify  the  geometry  of  the  problem.  The  Cartesian  coordinates  corre^nding  to  tlw 
atoms  listed  in  the  above  example  are  listed  as  foUows  in  what  is  called  a  PDB  file  format. 
Observe  that  the  first  bond  is  placed  along  the  z  axis  and  the  borul  angle  formed  by  the 
first  three  atoms  is  placed  in  the  xz  plane.  Also  note  the  atom  types  have  changed  to  a 
more  explicit  notation  \^ch  was  read  in  fi-om  another  file  called  the  RTF  file. 


Atom  Type 

X 

y 

z 

ATOM 

NT 

0.000 

0.000 

ATOM 

CT 

0.000 

1.453 

ATOM 

CT 

1.457 

0.000 

1.915 

ATOM 

C6R 

1.773 

1.208 

2.746 

ATOM 

C6R 

2.367 

2.139 

Example  of  PDB  FOe  Format 

The  calculation  of  the  Cartesian  coordinates  of  all  atoms  within  the  molecule  is 
necessary  for  every  evaluation  of  the  energy  function  corresponding  to  a  particular 
conformation.  The  Cartesian  coordinates  are  necessary  to  calculate  the  atomic  distances 
between  atoms  for  the  non-bonded  term  of  the  energy  fimcdon,  as  well  as  to  calculate 
dependent  bond  angles  and  dihedral  angles  which  are  not  explicitly  represemed  in  the 
Z-matrix. 

U^g  the  systems  approach  to  design  the  data  structure  for  the  molecular  energy 
model,  it  follows  that  two  major  categories  of  information  must  be  represemed.  The  input 
to  the  model  consists  of  both  structural  information  to  represent  the  three  dimensional 
conformation  of  the  molecule,  as  well  as  parameter  information  required  to  calculate  the 
components  of  the  energy  function.  The  required  output  is  the  Cartesian  coordinates  of 
each  of  the  atoms  of  the  molecule. 
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The  first  requiremoit  is  to  design  a  data  structure  wdiich  retains  the  structural 
information  present  in  the  Z>matrix.  Additionally,  tl^  data  structure  nnist  include  the 
Cartesian  coordinates  of  each  atom.  The  energy  fiinction  requires  access  to  constant 
parameters  for  each  bonded  term  of  the  energy  function.  These  parameters  are 
determined  by  the  specific  combination  of  atom  types  in  the  bonded  relationship.  The 
number  of  bonded  relationships  which  exist  in  a  molecule  includes  those  ejq^lidtly 
represented  in  the  Z-matrix,  as  well  as  those  bonded  rdationships  \diich  can  be  implicitly 
determined  fix>m  the  Z-matrix.  The  number  of  bonded  rdationships  is  roughly  of  order  n, 
where  n  is  the  number  of  atoms  present  in  the  molecule.  The  number  of  non-bonded  atom 
pairs  is  approximately  n{n-\yi.  The  number  of  atoms  in  the  molecule  is  known  a  priori; 
however,  the  number  of  bonded  and  non-bonded  imeraction  terms  depend  upon  the 
structural  information  present  in  the  Z-matrix.  Consequently,  a  two  level  ^roach  to  the 
data  structure  design  was  implemented.  The  first  level  of  information  pertained  to  the 
specific  information  corresponding  to  each  individual  atom.  The  next  level  of  information 
pertained  to  the  specific  information  required  to  evaluate  each  component  of  the  energy 
fiinction  and  was  generated  fi’om  the  first  level  of  information.  This  approach  resulted  in 
two  distinct  data  structures: 

typedef  struct 
{ 


char  type_of_atom(ATOM_NAME_LENGTH] ;  /*  atom  type  -  C,  N,  etc 

*/ 

double  charge; 

/* 

atomic  charge  of  atom 

*/ 

double  X,  y,  z; 

/* 

Cartesian  coordinates  of  atom 

*/ 

int  atom_j; 

/* 

atom  number  of  direct  parents 

*/ 

double  bond_length; 

/* 

fixed  bond  length  between  i  £  j 

*/ 

int  bond_group; 

/* 

Oi^Fixed,  l=°Independent 

*/ 

int  atom_k; 

/* 

third  atom  defining  bond  angle 

*/ 

double  bond_angle; 

/* 

fixed  bond  angle  between  i,  j  &  k 

*/ 

int  angle_group; 

/* 

0=Fixedr  l^Independent 

*/ 

int  atom_l; 

/* 

fourth  atom  forming  dihedral 

*/ 

double  dlhedral_angle; 

/* 

independent  dihedral  angle 

*/ 

int  dihedral_group; 

/* 

0=Fixed,  l=Independent 

*/ 

)  ATOM  TYPE; 
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typedef  struct  group 
( 

Int  1; 
int  j; 

Int  k; 

Int  1; 

double  kjpam;  !*  constant  paramter  */ 

doiible  bodanglephl;  I*  bond,  angle,  or  phase  parameter 
Int  n_orjphase;  /*  n  or  phase  parameter  *! 

struct  group  *next 
}  GROUP  LIST; 


The  first  data  structure  resembles  the  cmitents  ofthe  Z-matrix,  with  the  additimi  of 
fidds  to  hdd  the  Cartesian  coordinates  of  eadi  atom.  The  entire  molecule  is  re|wesented 
as  an  array  of  records  of  type  ATOMJTYPE.  Any  atom  spedfic  information  can  then  be 
referenced  by  atom  number  and  field,  as  in  aton^i].fidld.  The  second  data  structure  is  a 
dynamic  data  structure  which  contains  the  atom  numbers  and  constant  parameters 
associated  with  a  bonded  or  non-bonded  interaction  torn  of  the  energy  function.  Tltt 
linked  list  structure  can  contain  as  many  records  as  there  are  bonded  arul  tK>n-bonded 
interactions.  Separate  lists  were  created  for  eadi  component  of  the  energy  function,  as 
well  as  idiether  the  structural  geometry  of  the  bond  rdationship  is  fixed,  dependent,  or 
independently  variable.  This  resulted  in  tea  separate  linked  lists,  which  were  instantiated 
as  follows: 

GROUP_LIST  *Non_bond; 

GROUP_LIST  *Fixed_bond; 

GROUP_LIST  *Depend_bond; 

GROUP_LIST  *Indep_bond; 

GROUP_LIST  *Fixed_angle; 

GROUP_LIST  *Depend_angle; 

GROUP_LIST  *Indep_angle; 

GROUP_LIST  *Flxed_dlhedral; 

GROUP_LI ST  *  Depend_dlhedral ; 

GROUP_LIST  *Indep_dihedral; 


The  first  step  in  initializing  the  data  structures  b^ins  by  reading  in  the  Z-matrix  as 
an  array  of  atoms  of  type  ATOM_TyPE.  After  reading  the  Z-matrix  into  the  qrpropriate 
fidds  of  the  array  of  records,  a  function  called  create _groups  is  executed  which  builds  the 
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lists  of  at(Hns  with  boiuied  and  non-bonded  interactions  betweoi  than.  This  is  done  by 
systematically  con^aring  the  indices  of  atoms  to  which  each  atom  is  bonded.  All  bonded 
relationships  idoidfied  in  the  Z-matrix  are  ehha  indepoidently  variable  or  fixed, 
depending  on  the  flag  field  for  each  variable.  AU  bonded  relationships  which  exist,  but  are 
not  explicitly  represented  in  the  Z-matrix  are  identified  and  added  to  the  dqiendent 
variable  lists  and  there  values  are  daermined  by  the  values  of  the  dq)aKlent  and 
independent  variables.  Non-bonded  atom  pairs  are  those  pairs  of  atoms  which  are  not 
involved  in  a  bond,  bond  angle,  or  dihedral  rdationship  with  each  otha.  Aita  each  of  the 
ten  group  lists  are  created,  the  paramaers  associated  with  the  atom  types  and  the  term  of 
the  enagy  function,  are  found  in  a  parameta  file  and  stored  in  the  list  structure.  This 
allows  the  paramaers  from  the  paramaa  file  to  be  read  only  once,  and  makes  all  required 
paramaers  for  each  evaluation  of  the  enagy  fiinction  available  fi’om  this  point  forward. 

4.L2  Parameter  Spedfieatwns  Fite.  The  constant  paramaers  associated  with 
the  various  components  of  the  enagy  function  are  constants  wMch  have  been 
experimentally  derived  fi’om  actual  observed  data.  The  source  of  paramaers  for  this 
research  is  the  paramaa  file  used  by  anotha  softwae  package  called  CHARMm, 
specifically  version  22.0,  last  updated  92/10/05.  This  file  contains  the  constant  paramaers 
associated  with  bond  lengths,  bond  angles,  dihedral  angles,  and  non-bonded  pairs.  The 
format  of  the  file  varies  according  to  the  numba  of  atoms  involved  in  the  interaction; 
howeva,  the  general  format  is  as  follows. 


[aom_type  1]  [aomjype  2]  [aomjype  3]  (aomjtype  4]  [1st  pammaer]  [2rKi  paramaer] 

Paramaa  File  Fomut 

The  third  and  fourth  atom  types  are  only  present  for  bond  angle  and  dihedral  angle 
paramaas,  respeaively.  The  paramaa  file  is  organized  into  subseaions,  with  headings 
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separating  the  different  types  of  parameters  according  to  the  term  of  the  energy  function 
to  which  they  corresponded.  The  order  of  the  groups  starts  with  the  parameters  for  the 
bond  energy  term  of  the  energy  function.  Tho'e  are  two  empirical  parameters  for  bond 
energy;  the  leading  constant  term,  and  equilibrium  radius,  both  of  which  depoid  upon  the 
types  of  atoms  in  the  bonded  pair.  The  next  set  of  parameters  in  the  parameter  file  are 
those  associated  with  bond  angle  energy.  Again  there  are  two  parameto^;  the  first  is  the 
leading  constant  term,  and  the  second  is  the  equilibrium  bond  angle.  The  third  set  of 
parameters  are  those  associated  with  the  dihedrd  angle  eno’gy  term  of  the  energy 
function,  and  include  a  leading  constant  term  and  an  equilibrium  dihedral  angle.  The 
parameters  for  bond  and  dihedral  angles  depend  upon  the  types  and  the  order  of  the  three 
and  four  atoms,  respectively,  involved  in  the  interaction.  Because  of  the  combinatoric 
nature  of  ordered  relations,  many  of  the  dihedral  parameters  are  only  specified  by  specific 
second  and  third  atom  types,  in  order  to  reduce  the  size  of  the  parameter  file.  Generic 
atom  type  positions  are  identified  in  the  parameter  file  with  a  single  capital  "X"  placed  in 
the  field  normally  used  for  the  atom  type. 

The  next  set  of  parameters  listed  in  the  parameter  file,  after  the  dihedral 
parameters,  are  parameters  associated  with  improper  angle  energy.  The  improper  angle 
energy  is  another  component  of  the  energy  function  which  can  be  modeled  empirically; 
however,  this  term  is  not  included  in  this  research  implementation  because  of  its  relatively 
small  contribution,  and  also  to  facilitate  comparisons  to  other  published  results  that  did  not 
model  the  improper  angle  energy  term.  Consequently,  this  section  of  the  parameter  file  is 
skipped  to  reach  the  final  set  of  parameters  used;  the  parameters  associated  with  the  non- 
bonded  interaction  term  of  the  energy  function.  Unlike  the  other  parameters  so  far,  which 
correspond  directly  to  terms  within  the  energy  function,  the  non-bonded  section  of  the 
parameter  file  provides  the  values  for  and  Rmin  for  each  individual  atom  type.  The 
corresponding  parameter  values  for  the  non-bonded  term  of  the  energy  function  are 
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derived  from  the  following  relationships,  where  i  and  j  are  the  numbers  corre^nding  to 
the  pair  of  atoms  involved  in  the  non-bonded  interaction. 

^ij  ~  ^miny 

p  _  D  I  p 

Aj  = 

Locating  every  required  parameter  by  directly  searching  the  parameter  file  would 
require  an  extensive  amount  of  I/O  time.  Consequently,  in  order  to  minimize  the  amount 
of  I/O  used  to  read  the  data  from  the  parameter  file,  the  first  set  of  parameters  is  read  into 
an  array  structure,  which  is  then  used  during  the  search  process  to  locate  the  ^propriate 
parameters  according  to  the  atom  types  present.  Afrer  all  the  required  parameters 
associated  with  that  section  of  the  parameter  file  are  located  and  stored  in  the  group  list 
structure,  the  array  of  parameters  is  cleared  and  the  nect  set  of  parameters  are  read  in 
from  the  parameter  file.  This  process  continues  until  all  parameters  required  to  evaluate 
the  energy  function  are  stored  within  the  group  list  structure;  thus  allowing  the  parameter 
file  to  be  closed  since  it  is  no  longer  needed. 

4.1.3  Implementation  of  Empirical  Energy  Equation.  The  implementation  of 
the  empirical  energy  function  follows  a  functional  programming  paradigm  whore  each 
component  of  the  energy  function  is  implemented  as  a  separate  function.  This  facilitates  a 
summation  operation  which  is  performed  for  each  of  the  individual  groups  lists 
corresponding  to  the  separate  terms  of  the  energy  function,  as  well  as  allowing  analysis  of 
the  individual  contribution  of  each  component  of  the  energy  fijnction.  Since  all  the 
parameters  for  each  of  the  energy  terms  are  stored  in  the  group  list  structure,  the 
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functional  implementation  of  each  term  of  the  energy  function  can  be  coded  so  that  the 
constants  and  variables  are  passed  as  parameters  to  the  function.  The  following  pseudo 
code  demonstrates  the  functional  programming  format  used  for  the  individual  terms  of  the 
energy  fimction. 

energy_component (Kl,  K2/  x) 

{ 

contribution  =  f(Kl,  K2,  x) 
return (contribution) 

} 


The  constants  K1  and  K2  correspond  to  the  leading  coefficient  and  the  equilibrium 
constant  found  in  the  bond,  bond  angle,  and  dihedral  angle  energy  terms  of  the  energy 
function.  The  individual  contribution  of  this  bonded  interaction  is  equal  to  a  function  of 
Kl,  K2,  and  the  variable  x,  which  is  either  the  bond  length,  bond  angle  or  the  dihedral 
angle  corresponding  to  the  geometry  of  the  group  of  atoms  in  the  specific  conformation 
being  analyzed. 

The  summation  of  the  total  energy  is  accomplished  by  indexing  through  each  of  the 
group  lists,  calling  the  appropriate  energy  component  function  and  summing  over  all  terms 
of  the  energy  function.  The  following  pseudo  code  demonstrates  the  summation  of  the 
total  energy  over  each  component  of  the  energy  fiinction. 

total_energy  =  0.0 

for  i  =  1  to  nuinber_of_group_lists 

I 

indexPtr  =  group_list [i] 
energy_contribution  =  0.0 
while  (indexPtr  *  NULL) 

( 

energy_contribution  =  energy_contribution  + 

energy_coinponent (indexPtr->Kl,  indexPtr->K2,  indexPtr->x) 
indexPtr  =  indexPtr->next 

) 

total_energy  =  total_energy  +  energy_contribution 
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Although  the  above  pseudo  code  does  demonstrate  the  graeral  process  of  how  the 
energy  function  is  evaluated,  the  actual  implementation  differs  for  several  reasons.  First, 
the  function  called  to  evaluate  the  individual  Clergy  component  depoids  upon  the  type  of 
group  list  being  evaluated,  since  each  group  list  is  associated  with  a  ^)ecific  term  of  the 
energy  function.  Second,  the  source  of  the  variable  x,  also  dq)ends  upon  the  type  of 
group  list.  Fbced  variable  group  lists  retrieve  the  variable  x  from  the  original  data  stored  in 
the  array  of  records  structure  initialized  from  the  Z-matrix.  Independent  variable  group 
lists  acquire  a  candidate  solution  for  x  from  the  encoded  string  representation  of  the  GA. 
Finally,  dependent  variable  group  lists  must  call  an  intermediate  function  to  determine  the 
variable  x  from  the  Cartesian  coordinates  of  the  individual  atoms.  Similady,  the  evaluation 
of  the  non-bonded  energy  requires  the  intermediate  calculation  of  the  distance  between 
two  non-bonded  atoms,  in  order  to  calculate  the  individual  energy  contribution  of  the  pair 
of  atoms. 

4.1.4  Validation  of  the  Energy  Mod^.  The  process  used  to  validate  the  accuracy 
of  this  implementation  of  the  energy  model  involved  calculating  the  initial  energy  of  a  test 
molecule,  and  comparing  this  energy  to  that  which  is  obtained  using  the  CHARMm 
software  package  for  the  same  molecule.  Both  models  used  the  same  Z-matrix  from 
which  to  derive  the  geometric  information  relating  to  the  specific  conformation  of  the  test 
molecule.  The  energy  model  used  by  CHARMm  is  also  empirically  based,  using  the  same 
parameter  file  and  the  same  equations  for  the  individual  energy  components.  The 
CHARMm  model,  however,  does  model  other  energy  components  which  are  not  included 
in  the  GA  model,  such  as  the  improper  angle  energy.  CHARMm  does  have  the  ctq)ability 
to  report  the  individual  contributions  of  each  term  of  the  energy  function,  consequently 
allowing  a  direct  comparison  between  the  energies  found  with  the  GA  energy  model  and 
the  energies  determined  by  CHARMm.  Since  the  CHARMm  software  package  is  widely 
accepted  as  an  accurate  model  of  molecular  energy,  the  GA  model  was  considered  to  be 
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sufBdently  validated  whoi  its  calculated  energy  contributions  agreed  vdth  those  rqx»ted 
by  CHARMm  for  the  same  molecule. 

4.2  Integration  of  the  Energy  Model  Into  the  Genetic  Algorithm 

The  energy  model  serves  as  the  GA's  fitness  fimction  to  evaluate  the  fitness  of 
population  members,  thus  determining  their  survival  rate  into  the  next  generation.  The 
interfile  between  the  GA  and  the  energy  modd  is  through  the  fimction  called  eval,  which 
evaluates  the  fitness  of  each  member  of  the  population.  Considering  the  eno^  for  each 
population  member  must  be  calculated  for  each  generation,  several  steps  were  taken  to 
maximize  the  efifidoicy  of  the  energy  calculation.  Finally,  the  clean  interfiice  between  the 
GA  and  the  energy  modd  allowed  the  integration  of  the  model  into  sevotd  dififerent 
versions  of  the  GA 

4.2.1  Evatuadon  ef  Fitness.  With  each  genmuion,  the  fitness  of  each  population 
member  is  determined  by  calling  the  eval  fimction.  The  call  to  the  fur>^tion  eval  passes  a 
population  string  as  its  argument.  It  is  the  encoding  of  the  GA  string  ^ch  provides  the 
interfiu:e  between  the  energy  model  and  the  GA,  since  each  string  corresponds  to  a  unique 
spatial  conformation  of  the  molecule  being  modeled.  The  string  rq>reso)tation  is  decoded 
into  the  independent  variables  within  the  molecule,  such  as  bond  lengths,  bond  angles,  or 
dihedral  angles.  After  all  independent  variables  are  decoded,  and  in  conjunction  with  the 
structural  information  retained  fi’om  the  Z-matrix,  the  Cartesian  coordinates  of  each  atom 
can  be  determined.  The  calculation  of  the  molecular  energy  follows  by  indexing  through 
the  previously  defined  group  lists,  summing  up  all  of  the  individual  components  of  energy 
to  arrive  at  the  total  energy.  This  total  energy  corresponds  to  the  specific  conformation 
defined  by  the  string  representation  and  is  returned  as  a  value  of  fitness. 

The  GA  uses  the  returned  energy  value  directly  to  proportionally  allocate  strings 
into  the  next  generation.  In  order  to  evolve  solutions  which  minimize  the  energy  of  the 
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molecule,  the  GA  must  allocate  more  copies  in  future  generations  to  solutions  with  high 
fitness  levels  (low  energy  values).  Whh  each  succesave  generation,  the  solutions 
correspond  to  three  dimensional  conformations  with  lower  and  lower  merges.  After  a 
predefined  number  of  fitness  evaluations,  the  best  solution  found  so  far  is  taken  as  the  fiiud 
solution,  and  execution  is  hahed.  To  evaluate  the  quality  of  the  final  solution,  the 
corresponding  Cartesian  coordinates  of  each  of  the  atoms  are  printed  to  a  file  in  a 
particular  format  which  can  then  be  read  by  the  QUANTA  software  package  vdiich 
renders  an  image  of  the  molecule  on  the  screoi  for  examination. 

4.2.2  Efficiency  Consideradons.  Since  the  execution  of  the  genetic  algorithm  is 
dominated  be  the  time  required  to  evaluate  the  fitness  of  each  string  for  every  generation, 
every  consideration  should  be  given  to  maximi^g  the  efficiency  of  the  calculation  of  the 
molecular  energy.  One  way  to  accomplish  this  would  be  to  eliminate  any  unnecessarily 
redundant  calculations.  For  most  energy  minimization  problems,  only  a  subset  of  the 
dihedral  angles  are  allowed  to  vary  independently.  The  bond  lengths,  bond  angles,  and  the 
remaining  dihedral  angles  are  held  fixed.  Consequently,  the  energy  associated  with  these 
fixed  interactions  can  be  calculated  once  and  stored,  rather  than  recalculating  it  for  every 
evaluation.  This  was  accomplished  by  implementing  a  global  variable  called  fixed_energy, 
which  stores  the  sum  .f  all  the  fixed  energies  calculated  during  initialization.  The  total 
energy  is  for  each  evaluation  starts  with  the  fixed  energy,  and  then  proceeds  to  add  the 
contributions  fi'om  the  independently  variable  energy  terms,  the  dependency  variable 
terms,  and  the  non-bonded  interaction  term  which  is  dependent  on  the  global  structure. 

Another  efficiency  consideration  is  the  availability  of  ail  the  parameters  necessary 
to  evaluate  the  empiiicai  energy  function.  In  order  to  avoid  searching  for  the  parameters 
with  each  evaluation,  all  necessary  parameters  are  identified  during  the  initialization  phase 
and  stored  in  a  record  structure  of  the  atom  group  lists.  The  parameter  file  is  only 
searched  once,  and  all  subsequent  references  to  parameters  can  be  made  directly. 
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Consequently,  the  evaluation  of  the  total  energy  can  proceed  by  Aq>ping  through  each 
atom  group  list,  and  calling  the  appropriate  energy  conq)onait  function  with  the 
parametors  passed  as  arguments. 
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V.  Resutts  and  Evaluation 


In  order  to  evahiate  the  perfonnance  of  GA's  as  a  coofixmatioiial  analysis  tod,  the 
developed  potential  energy  model  is  int^rated  into  both  a  sequendal  and  a  paraUd 
implementation  of  a  simple  genetic  algorithm  based  on  (£NESIS  (Greftnstette.  1986), 
and  tested  on  a  real  protein.  The  purpose  of  the  experiment  was  two  fold;  first  to 
determine  \^iether  or  not  GA's  are  an  effective  seardi  tedmique  fix  minimiang  mdecular 
energies  fi^r  conformational  analysis  of  proteins,  and  also  to  assess  the  scalability  of  GA's 
on  parallel  conqxiter  architectures.  To  acccmq)lish  the  first  objective,  the  GA  is  shown  to 
eflfectivdy  minimize  the  energy  of  the  test  proUem.  The  second  objective  is  accomfdished 
by  demonstrating  near  linear  speedup  of  a  paralld  implementation  of  the  GA  with  ix> 
corre^nding  degradation  in  solution  quality. 

5.1  Sequential  Implafnentation  Test  Results 

S.1.I  Devdt^menl  of  Test  Set.  The  test  {Nddem  dx)sen  is  the  mdecule 
[Met]-Enkq>halin,  \diich  is  a  five  residue  pqrtide.  This  molecule  was  chosen  both  for  its 
small  size  and  because  it  has  a  known  conformational  structure,  including  a  beta  turn.  The 
bond  lengths  and  bond  angles  were  hdd  fixed,  while  the  (fihedral  angles  along  the 
backbone  (four  each  of  \|/,  and  o  angles)  and  9  sidediain  dihedral  angles  (x  angles)  of 
the  protein  where  allowed  to  vary  indq}endently,  for  a  total  of  21  indqieodent  variables. 
A  resolution  of  q>proximatdy  0.35  d^ree  inomnents  was  adueved  by  encoding  10  Irinary 
characters  for  each  independent  variable  that  resulted  in  a  total  string  length  of  210  bits  for 
each  monber  of  the  population  of  strings. 
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5.1.2  Experimental  Procedure.  The  sequential  impleinentation  is  peiformed  on  a 
Sparc  Station  II,  executing  in  the  background.  Because  of  the  long  execution  times,  up  to 
14  hours  per  run,  the  number  of  expoiments  performed  was  limited  to  ten  individual  runs. 
Various  population  sizes  and  mutation  rates  were  performed  in  order  to  gain  empirical 
support  for  choosing  appropriate  GA  parameters.  The  best  run  was  evaluated  on  the  basis 
of  final  solution  quality  and  also  upon  the  population  convergence  evidoiced  by  the 
degradation  in  population  variance. 

The  effect  of  population  size  on  convergence  is  demonstrated  in  Figures  5-8.  AVith 
a  relatively  small  population  size  of  SO,  the  GA  converges  to  a  single  solution  after  25,000 
fitness  evaluations.  The  convergence  is  more  evident  in  Figures  6  and  7  where  the 
variance  is  shown  to  decrease  toward  zero  and  the  average  population  fitness  becomes 
equal  to  the  best  fitness  found.  Increasing  the  population  size  provides  more  genetic 
material,  allowing  the  GA  to  continue  to  explore  the  conformational  search  space.  This 
continued  exploration  is  demonstrated  in  Fi^eS,  where  the  population  does  not 
converge  but  rather  the  algorithm  stops  who)  reaching  the  predefined  maximum  number 
of 500,000  fitness  evaluations. 
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Population  Fitness  I  I  Population  Fitness 


Figure  S.  Convergence  After  25,000  Evahiations 


Figure  6.  Convergoice  Afto- 120,000  Evahiations 
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Population  Fitness  I  I  Population  Fitness 


Population  Size  »  200 
Mutation  Rate  -  0.00001 
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Figure  7.  Convergence  Just  Before  Reaching  500,000  Evaluations 


Population  Size  =  500 
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Figure  8.  Maintains  Population  Diversity  Till  Reaching  the  Stopping  Criteria 
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The  effectiveness  of  the  GA  confonnational  search  in  terms  of  solution  quality  can 
be  seen  in  Figure  9,  which  plots  the  best  solution  found  so  far  during  execution  for  each  of 
the  various  population  sizes.  It  is  interesting  to  note  that  the  results  from  the  population 
size  of  200  outperforms  the  results  achieved  using  a  larger  population  size  of  SOO.  It  is 
possible,  however,  that  the  population  of  SOO  would  have  been  able  to  find  a  b^er 
solution,  had  it  been  allowed  to  continue  to  evolve  to  the  point  of  convergence.  The 
amount  of  time  allowed  for  a  GA  to  execute,  represents  the  tradeoff  between  exploration 
and  exploitation. 


Best  Performance  for  Various  Population  Sizes 
(Mutation  Rate  =  0.00C)01) 


Figure  9.  Relative  Performance  of  Various  Population  Sizes 


5.1.3  Evaluatwn  of  Performance.  For  the  sequential  implementation,  the 
resulting  conformation  could  be  directly  compared  to  the  accepted  known  conformational 
structure.  The  comparison  is  performed  by  comparing  the  corresponding  backbone  and 
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1 

1  Energies  I 

1  Bon 

Ang. 

EHh. 

U 

Elec. 

Total 

Initial 

5.2 

13.8 

■HOQ 

-35.5 

97 

Initial  Minimized 

2.4 

6.2 

-34.7 

-36 

BestGA 

1  0.4 

2.6 

6.1 

-36.7 

-39 

Table  1.  Potential  Energy  Values 


sidechain  dihedral  angles.  The  lowest  energy  structure  generated  from  the  GA  application 
has  been  fruther  minimized,  using  a  steepest  descent  conjugate  gradient  local  minimizer. 
This  resulting  structure  is  then  compared  to  the  global  energy  minimum  structure  of 
[Met]-Enkephalin  by  direct  comparison  of  the  backbone  and  x  dihedral  angles. 

Table  1  presents  the  energy  values  (kcal/mol)  for  the  initial  structure,  the  initial 
minimized  structure,  and  the  best  GA  result  (after  minimization).  Table  2  provides  a 
comparison  between  the  various  backbone  dihedral  angles  (degrees)  for  the  same  three 
methods,  plus  the  results  reported  by  Nayeem,  Vila,  and  Scheraga,  (1991). 

It  is  interesting  to  point  out  that  although  the  energy  has  not  changed  substantially 
for  using  the  GA  compared  to  locally  minimizing  the  initial  structure,  the  conformational 
space  has  been  searched.  Indeed,  for  the  locally  nunimized  structure  the  <)>,  V  angles 
remain  about  180  deg  (<|>:  mean  =  -164;  std  =  32;  xj/:  mean  =  170;  std  =  1 1),  while  for  GA 
structure  the  beta-bend  is,  in  part,  being  reproduced.  These  results  are  still  being  refined, 
and  also  note  that  in  Nayeem's  (1991)  article  a  relatively  wide  range  of  results  has  been 
previously  reported. 

5.2  Parallel  Implementation  Test  Results 

5.2.1  Development  of  Test  Problem.  Similar  to  the  test  problem  chosen  for  the 
sequential  implementation,  the  molecule  [Met]-Enkephalin,  was  again  chosen  for  its  small 
size  and  known  conformational  structure.  However,  note  that  in  this  case  the  non-polar 
hydrogen  atoms  were  not  explicitly  included.  All  bond  lengths  and  bond  angles  were 
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again  hdd  fixed,  while  the  same  dilwdral  angles  along  the  backboiM  and  sidechain  dihedral 
angles  of  the  protein  were  allowed  to  vary  independoitly.  The  same  angle  resolution  was 
also  used,  resulting  in  the  same  total  string  loigth  of  210  bits  for  each  member  of  the 
population  of  strings. 


Residue 
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180 
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90 
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Minimized 
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-179 
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62 

87 
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GA 
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-179 
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64 

89 
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Nayeem 
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-173 
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-166 
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Minimized 
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-134 

Gly 

GA 
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82 
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Gly 

Initial 
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Minimized 
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84 
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Nayeem 

84 
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-170 

Phe 

Initial 
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-58 

Phe 

Minimized 
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Initial 
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79 
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-158 

180 
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-177 
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Nayeem 

-164 

160 

-180 

53 

175 

-180 

-59 

Table  2.  Comparison  of  Dihedral  Angles  for  Various  Methods 

The  lowest  energy  structure  generated  fi-om  the  GA  application  has  been  further 
minimized,  using  a  steepest  descent  conjugate  gradient  local  minimizer.  This  resulting 
structure  is  then  compared  to  the  global  energy  minimum  structure  of  [Met]-Enkephalin 
by  direct  comparison  of  the  backbone  and  x  dihedral  angles. 
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5.Z2  Ej^eriHiental  Procedure.  Preliminary  testing  was  perfomMd  on  both  a 
serial  and  a  parallel  implementation  of  the  GA  in  order  to  gain  insight  into  the  proper  GA 
settings  for  mutation  rates  and  population  sizes  for  this  particular  application.  The 
mutation  operator  was  implemented  as  a  bitwise  probability;  therefore,  in  order  to  achieve 
a  desired  probability  of  changing  a  string  structure,  the  bitwise  mutation  rate  must 
determined  by  the  following  relationship.  The  bitwise  mutation  rate  is  designated  and 
the  probability  of  mutation  for  a  string  structure  is  designated  as  Pa/.  where  the  length  of 
the  string  is  /.  The  mutation  rate  chosen  for  this  experiment  was  0.00001,  which 
corresponds  to  a  0.21%  probability  of  mutating  a  string  structure. 

Pa/' 1-0 -/>».)' 

The  GA  was  executed  on  an  Intel  iPSC/860  with  64  processor  elements.  In  order 
to  be  able  to  scale  up  to  a  total  of  32  processors,  a  total  population  size  of 640  was  chosen 
such  that  the  subpopulation  size  on  each  node  would  never  become  less  than  20 
individuals  per  processor.  This  ensured  th^  there  would  be  sufiBcient  genetic  material  to 
perform  useful  search  from  local  crossover  and  mutation,  m  order  to  evaluate  the 
scalability  of  the  GA,  the  test  problem  was  executed  on  various  partition  sizes  ranging 
from  1  to  32  nodes,  by  powers  of  2. 

5.2.3  Evaluation  of  Performance.  Both  execution  time  and  solution  quality  were 
averaged  over  S  runs  for  each  partition  size.  The  results  show  that  the  GA  achieved  near 
linear  speedup  as  the  number  of  processors  was  increased  (Figure  10).  In  addition,  there 
was  no  corresponding  loss  of  solution  quality  as  the  subpopulation  size  on  each  node  was 
reduced  by  a  factor  of  n/p,  where  n  is  the  total  population  size  and  p  is  the  number  of 
processors  (Figure  1 1). 
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Figure  10.  Near  Linear  Speedup 


Figure  11.  Relative  Performance  of  Parallel  GA 
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VI.  Conclusions  and  Future  Research 


The  genetic  algorithm  ofim  many  potential  advantages  over  other  existing 
conformational  analytis  techniques.  It  has  hem  demonstrated  that  a  GA  implementation, 
using  an  empirical  energy  model  as  the  fitness  fimction,  is  able  to  optimize  independently 
variable  dihedral  angles  to  minimize  tlw  energy  of  the  peptide  [Met]-Enkq)halin. 
Although  this  test  problem  is  relatively  small,  optimizing  over  only  21  iiulq)etul«itly 
varyitig  dihedral  angles,  the  GA  is  expected  to  scale  well  to  laiga*  problems.  Also, 
conformational  analysis  on  an  entire  molecule  is  not  always  necessary.  Often,  only  small 
segrnems  of  a  large  protein  may  be  of  interest  in  ordo*  to  predict  local  secondary  structure 
formations. 

GA's  are  probabilistic  algorithms  and  thus  can  not  guarantee  the  globally  minimal 
solution.  However,  combining  the  GA  with  more  traditional  local  optimization  techniques 
may  prove  to  be  a  useful  conformational  analytis  technique.  The  GA  could  be  used  to 
explore  large  search  spaces  to  locate  areas  of  potrmtial  interest,  and  then  a  simple  hill¬ 
climbing  algorithm  could  be  used  to  perform  local  optimizations  in  those  areas.  This 
approach  holds  particular  promise  in  the  application  of  conformational  analysis,  where  the 
GA  could  be  used  to  locate  a  population  of  good  candidate  solutions  which  biochemists 
could  then  use  as  a  starting  point  to  apply  more  domain  specific  knowledge  in  order  to 
optimize  the  candidate  solutions. 

One  potentially  limiting  fiictor  to  the  scalability  of  a  GA  applied  to  the  proton 
folding  problem  is  the  minimum  number  of  population  members  necessary  on  each 
proce!»or  in  order  to  perform  useful  search  through  crossover.  The  results  reported 
previously  scale  very  well  up  to  32  processors;  however,  increasing  the  number  of 
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processors  would  reduce  the  subpopulation  ^  and  eventually  may  lead  to  problems  of 
pronature  convergence  as  the  subpopulation  size  decreases  to  an  insufficioit  level. 
Consequently,  as  larger  and  larger  parallel  platforms  become  available,  the  question  arises 
as  to  what  is  a  sufficient  subpopulation  size  for  performing  local  crossover.  This  question 
should  be  addressed  through  additional  research  with  larger  proteins  and  larger  platforms. 
Since  the  evaluation  function  is  of  0(n^,  where  n  is  the  number  of  atoms  in  the  molecule, 
it  may  be  worthwhile  to  investigate  the  use  of  a  combined  approach  of  local  and  global 
crossover  in  order  to  scale  up  to  a  larger  numbers  of  processors  without  correspondingly 
increasing  the  total  population  size. 

Guidance  for  future  efforts  should  emphasize; 

•  Dynamically  controlled  parameters 

•  Improved  parallel  communications  strategies 

•  Optimization  of  code  for  evaluation  fimction 

•  Scale  algorithm  to  larger  parallel  platforms 

•  Scale  application  to  larger  problem  sizes 

Future  efforts  should  be  directed  towards  improving  the  effectiveness  of  the  GA  as 
a  function  optimizer.  The  performance  of  GAs  has  been  shown  to  be  sensitive  to 
parameter  settings.  The  implementation  of  dynamically  controlled  genetic  operators  offer 
the  potential  of  fine  tuning  the  parameter  settings  of  the  GA  during  execution  rather  than 
being  held  fixed.  Continued  efforts  to  optimize  the  code  should  be  made  in  order  to 
reduce  execution  time.  Refinements  to  the  parallel  decomposition  and  communications 
strategies  used  in  the  parallel  implementation  of  the  genetic  algorithm  should  also  decrease 
execution  time,  allowing  the  GA  to  be  applied  to  larger  problem  sizes.  Other  parallel 
decomposition  strategies  should  be  investigated  when  the  number  of  parallel  processors 
approaches  the  total  population  size.  It  may  become  necessary  to  develop  a  hybrid  type 
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island  model  where  a  subpopulation  exists  on  a  partition  of  the  processors  rather  than  on 
an  individual  processor. 
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